You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sushmitkarar <gi...@git.apache.org> on 2016/05/24 05:32:32 UTC

[GitHub] spark pull request: Glrm

GitHub user sushmitkarar opened a pull request:

    https://github.com/apache/spark/pull/13274

    Glrm

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rezazadeh/spark glrm

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13274
    
----
commit c7679f91bbf79bcefeb8c9f7ee968aac1f39b503
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-08-27T07:17:32Z

    First version of SparkGLRM

commit 1347655961e047488bcb7ceb753c16bb1c2d7e4a
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-08-27T07:19:02Z

    Documentation

commit 16ae855c6664c276a0b2ef5fbf3c625251c9a82c
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-07T01:20:54Z

    index bounds

commit aa24830dc22a1e95af6fea0282d31255fd335036
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-07T01:30:39Z

    More data

commit ee6cd5328458bd83d16f2f2e43a64fdac0b090f8
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-07T01:34:33Z

    Bigger dataset

commit be9a51b1cc77a8a546b8150dcd498cfaecb5f703
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-07T18:20:27Z

    Larger data

commit 99971db070d6923ca55148a1fcc9dc55ff068472
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-10T00:01:06Z

    Better random entry generation

commit 576d9ae365589d7e67cb697e6e7edbf7c70f1f0c
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-10T00:01:27Z

    Better parameters

commit 1e5afe8212257fa4d05cea06665979ff9b3a9cc7
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-10T00:02:35Z

    Better parameters

commit 04f48097a19de2857f49f162013fc22e217ab4eb
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-10T18:36:11Z

    Proper display of status

commit 7489302795e0787a70b885090603380d06d3f7a6
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-20T06:33:14Z

    chunking

commit 136d0310e5b5d2cb3341ea847b0a8fb989c21f77
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-21T06:34:02Z

    Pluggable proxes

commit 49c9ca72599a26d3ff91ce97739d9eec5bc24d8b
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-21T06:51:13Z

    Documentation

commit d8f07b4c66dce1fa0c7c3be4bfb978d62f63702b
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-22T05:15:55Z

    add documentation

commit 0e62894e10682e92c1d44375e3567697cf1c0056
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-22T05:18:40Z

    better spacing

commit d70cfe659a95c792cb234df05ed24fdcddcf44ad
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-25T22:19:17Z

    Better parameters

commit 2dae5b616604182b980978f5fb444d20f169b5eb
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-26T07:54:35Z

    Better loss grads pluggability

commit 8c9e977bac6f66dec6c4f3b1e55065807e75eb1b
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-26T18:04:58Z

    parmaeter changes

commit 5951d30c0aab9668be741d367ec7c0d57824a3d3
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-26T23:21:29Z

    better stepsizes and library of proxes

commit 2c3f75b30b00a6d6363e08c584017564b8c33a51
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-26T23:24:47Z

    better documentation

commit edae547949571a80a9a1cedba88c55e8f123a97c
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-26T23:29:28Z

    Better documentation

commit 6140f3f5aa202f6635f4dc07da8c9f790382968e
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-09-30T04:49:53Z

    Add funny loss

commit c1f2216c326b49b82703e01a20be95e718601f56
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-01T20:25:02Z

    Funny Loss example

commit 222e38dd40a12a3b6b9305609b8abd0ccdc61b8c
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-04T19:14:17Z

    New interface

commit 9be6c288795a5fe5e8a33afe8d1bb09174db9901
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-04T19:18:51Z

    Documentation

commit 643fd50f27c430c62a982f1ba38a3e190d097232
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-04T19:19:38Z

    Move to new directory

commit 20128d9e97e2ba8b19bfde3f57200d805f44a75e
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-04T19:21:49Z

    Readme first version

commit 51c4cb8e53a1549faed66f197a8821ca5618aa10
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-04T19:30:54Z

    Movement message

commit 9f2469d5d8073b3036ff7f712ab2d256b1fc72b6
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-04T19:50:03Z

    Initial README

commit 13693d09dd21c32c8c1a4047bc5021ed014db776
Author: Reza Zadeh <ri...@gmail.com>
Date:   2014-10-04T20:01:44Z

    Better readme

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Glrm

Posted by sushmitkarar <gi...@git.apache.org>.
Github user sushmitkarar closed the pull request at:

    https://github.com/apache/spark/pull/13274


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13274: Glrm

Posted by Tagar <gi...@git.apache.org>.
Github user Tagar commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13274#discussion_r198663163
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/glrm/SparkGLRM.scala ---
    @@ -0,0 +1,223 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regPenarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples
    +
    +import breeze.linalg.{DenseVector => BDV}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.broadcast.Broadcast
    +import org.apache.spark.mllib.linalg.distributed.MatrixEntry
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.{SparkConf, SparkContext}
    +
    +import scala.collection.BitSet
    +
    +/**
    + * Generalized Low Rank Models for Spark
    + *
    + * Run these commands from the spark root directory.
    + *
    + * Compile with:
    + * sbt/sbt assembly
    + *
    + * Run with:
    + * ./bin/spark-submit  --class org.apache.spark.examples.SparkGLRM  \
    + * ./examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar \
    + * --executor-memory 1G \
    + * --driver-memory 1G
    + */
    +
    +object SparkGLRM {
    +  /*********************************
    +   * GLRM: Bank of loss functions
    +   *********************************/
    +  def lossL2squaredGrad(i: Int, j: Int, prediction: Double, actual: Double): Double = {
    +    prediction - actual
    +  }
    +
    +  def lossL1Grad(i: Int, j: Int, prediction: Double, actual: Double): Double = {
    +    // a subgradient of L1
    +    math.signum(prediction - actual)
    +  }
    +
    +  def mixedLossGrad(i: Int, j: Int, prediction: Double, actual: Double): Double = {
    +    // weird loss function subgradient for demonstration
    +    if (i + j % 2 == 0) lossL1Grad(i, j, prediction, actual) else lossL2squaredGrad(i, j, prediction, actual)
    +  }
    +
    +  /***********************************
    +   * GLRM: Bank of prox functions
    +   **********************************/
    +  // L2 prox
    +  def proxL2(v:BDV[Double], stepSize:Double, regPen:Double): BDV[Double] = {
    +    val arr = v.toArray.map(x => x / (1.0 + stepSize * regPen))
    +    new BDV[Double](arr)
    +  }
    +
    +  // L1 prox
    +  def proxL1(v:BDV[Double], stepSize:Double, regPen:Double): BDV[Double] = {
    +    val sr = regPen * stepSize
    +    val arr = v.toArray.map(x =>
    +      if (math.abs(x) < sr) 0
    +      else if (x < -sr) x + sr
    +      else x - sr
    +    )
    +    new BDV[Double](arr)
    +  }
    +
    +  // Non-negative prox
    +  def proxNonneg(v:BDV[Double], stepSize:Double, regPen:Double): BDV[Double] = {
    +    val arr = v.toArray.map(x => math.max(x, 0))
    +    new BDV[Double](arr)
    +  }
    +
    +  /* End of GLRM libarry */
    +
    +
    +  // Helper functions for updating
    +  def computeLossGrads(ms: Broadcast[Array[BDV[Double]]], us: Broadcast[Array[BDV[Double]]],
    +                       R: RDD[(Int, Int, Double)],
    +                       lossGrad: (Int, Int, Double, Double) => Double) : RDD[(Int, Int, Double)] = {
    +    R.map { case (i, j, rij) => (i, j, lossGrad(i, j, ms.value(i).dot(us.value(j)), rij))}
    +  }
    +
    +  // Update factors
    +  def update(us: Broadcast[Array[BDV[Double]]], ms: Broadcast[Array[BDV[Double]]],
    +             lossGrads: RDD[(Int, Int, Double)], stepSize: Double,
    +             nnz: Array[Double],
    +             prox: (BDV[Double], Double, Double) => BDV[Double], regPen: Double)
    +  : Array[BDV[Double]] = {
    +    val rank = ms.value(0).length
    +    val ret = Array.fill(ms.value.size)(BDV.zeros[Double](rank))
    +
    +    val retu = lossGrads.map { case (i, j, lossij) => (i, us.value(j) * lossij) } // vector/scalar multiply
    +                .reduceByKey(_ + _).collect() // vector addition through breeze
    +
    +    for (entry <- retu) {
    +      val idx = entry._1
    +      val g = entry._2
    +      val alpha = (stepSize / (nnz(idx) + 1))
    +
    +      ret(idx) = prox(ms.value(idx) - g * alpha, alpha, regPen)
    +    }
    +
    +    ret
    +  }
    +
    +  def fitGLRM(R: RDD[(Int, Int, Double)], M:Int, U:Int,
    +              lossFunctionGrad: (Int, Int, Double, Double) => Double,
    +              moviesProx: (BDV[Double], Double, Double) => BDV[Double],
    +              usersProx: (BDV[Double], Double, Double) => BDV[Double],
    +              rank: Int,
    +              numIterations: Int,
    +              regPen: Double) : (Array[BDV[Double]], Array[BDV[Double]], Array[Double]) = {
    +    // Transpose data
    +    val RT = R.map { case (i, j, rij) => (j, i, rij) }.cache()
    +
    +    val sc = R.context
    +
    +    // Compute number of nonzeros per row and column
    +    val mCountRDD = R.map { case (i, j, rij) => (i, 1) }.reduceByKey(_ + _).collect()
    +    val mCount = Array.ofDim[Double](M)
    +    for (entry <- mCountRDD)
    +      mCount(entry._1) = entry._2
    +    val maxM = mCount.max
    +    val uCountRDD = R.map { case (i, j, rij) => (j, 1) }.reduceByKey(_ + _).collect()
    +    val uCount = Array.ofDim[Double](U)
    +    for (entry <- uCountRDD)
    +      uCount(entry._1) = entry._2
    +    val maxU = uCount.max
    +
    +    // Initialize m and u
    +    var ms = Array.fill(M)(BDV[Double](Array.tabulate(rank)(x => math.random / (M * U))))
    +    var us = Array.fill(U)(BDV[Double](Array.tabulate(rank)(x => math.random / (M * U))))
    +
    +    // Iteratively update movies then users
    +    var msb = sc.broadcast(ms)
    +    var usb = sc.broadcast(us)
    --- End diff --
    
    Does it make assumption that both `ms` and `us` can fit on each of the executors? 
    How well does it scale?
    Thanks!



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13274: Glrm

Posted by Tagar <gi...@git.apache.org>.
Github user Tagar commented on the issue:

    https://github.com/apache/spark/pull/13274
  
    @rezazadeh is there is any plan to incorporate GLRM into core Spark? 
    It seems https://github.com/rezazadeh/spark/tree/glrm/examples/src/main/scala/org/apache/spark/examples/glrm hasn't had updates for several years.. is GLRM for Spark maintained somewhere else? 
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org