You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by chris snow <ch...@gmail.com> on 2017/03/26 22:48:58 UTC

Collaborative filtering steps in spark

In the paper “Large-Scale Parallel Collaborative Filtering for the
Netflix Prize”, the following steps are described for ALS:

Step 1 Initialize matrix M by assigning the average rating for that
movie as the first row, and
small random numbers for the remaining entries.
Step 2 Fix M, Solve U by minimizing the objective function (the sum of
squared errors);
Step 3 Fix U, solve M by minimizing the objective function similarly;
Step 4 Repeat Steps 2 and 3 until a stopping criterion is satisfied.

Does spark take the average rating for the movie as the first row?
I've looked through the source code, but I can't see the average
rating being calculated for the movie.

Many thanks,

Chris

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Collaborative filtering steps in spark

Posted by chris snow <ch...@gmail.com>.

Thanks Nick, that helps my with my understanding of ALS.


On Wed, 29 Mar 2017 at 14:41, Nick Pentreath <ni...@gmail.com>
wrote:

> No, it does a random initialization. It does use a slightly different
> approach from pure normal random - it chooses non-negative draws which
> results in very slightly better results empirically.
>
> In practice I'm not sure if the average rating approach will make a big
> difference (it's been a long while since I read the paper!)
>
> Sean put the absolute value init stuff in originally if I recall so may
> have more context.
>
> Though in fact looking at the code now, I see the comment still says that,
> but I'm not convinced the code actually does it:
>
> /**
>  * Initializes factors randomly given the in-link blocks.
>  *
>  * @param inBlocks in-link blocks
>  * @param rank rank
>  * @return initialized factor blocks
>  */
> private def initialize[ID](
>     inBlocks: RDD[(Int, InBlock[ID])],
>     rank: Int,
>     seed: Long): RDD[(Int, FactorBlock)] = {
>   // Choose a unit vector uniformly at random from the unit sphere, but from the
>   // "first quadrant" where all elements are nonnegative. This can be done by choosing
>   // elements distributed as Normal(0,1) and taking the absolute value, and then normalizing.
>   // This appears to create factorizations that have a slightly better reconstruction
>   // (<1%) compared picking elements uniformly at random in [0,1].
>   inBlocks.map { case (srcBlockId, inBlock) =>
>     val random = new XORShiftRandom(byteswap64(seed ^ srcBlockId))
>     val factors = Array.fill(inBlock.srcIds.length) {
>       val factor = Array.fill(rank)(random.nextGaussian().toFloat)
>       val nrm = blas.snrm2(rank, factor, 1)
>       blas.sscal(rank, 1.0f / nrm, factor, 1)
>       factor
>     }
>     (srcBlockId, factors)
>   }
> }
>
>
> factor is ~ N(0, 1) and then scaled by the L2 norm, but it looks to me the
> abs value is never taken before scaling which is what the comment
> indicates...
>
>
> On Mon, 27 Mar 2017 at 00:55 chris snow <ch...@gmail.com> wrote:
>
> In the paper “Large-Scale Parallel Collaborative Filtering for the
> Netflix Prize”, the following steps are described for ALS:
>
> Step 1 Initialize matrix M by assigning the average rating for that
> movie as the first row, and
> small random numbers for the remaining entries.
> Step 2 Fix M, Solve U by minimizing the objective function (the sum of
> squared errors);
> Step 3 Fix U, solve M by minimizing the objective function similarly;
> Step 4 Repeat Steps 2 and 3 until a stopping criterion is satisfied.
>
> Does spark take the average rating for the movie as the first row?
> I've looked through the source code, but I can't see the average
> rating being calculated for the movie.
>
> Many thanks,
>
> Chris
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Collaborative filtering steps in spark

Posted by Nick Pentreath <ni...@gmail.com>.

No, it does a random initialization. It does use a slightly different
approach from pure normal random - it chooses non-negative draws which
results in very slightly better results empirically.

In practice I'm not sure if the average rating approach will make a big
difference (it's been a long while since I read the paper!)

Sean put the absolute value init stuff in originally if I recall so may
have more context.

Though in fact looking at the code now, I see the comment still says that,
but I'm not convinced the code actually does it:

/**
 * Initializes factors randomly given the in-link blocks.
 *
 * @param inBlocks in-link blocks
 * @param rank rank
 * @return initialized factor blocks
 */
private def initialize[ID](
    inBlocks: RDD[(Int, InBlock[ID])],
    rank: Int,
    seed: Long): RDD[(Int, FactorBlock)] = {
  // Choose a unit vector uniformly at random from the unit sphere, but from the
  // "first quadrant" where all elements are nonnegative. This can be
done by choosing
  // elements distributed as Normal(0,1) and taking the absolute
value, and then normalizing.
  // This appears to create factorizations that have a slightly better
reconstruction
  // (<1%) compared picking elements uniformly at random in [0,1].
  inBlocks.map { case (srcBlockId, inBlock) =>
    val random = new XORShiftRandom(byteswap64(seed ^ srcBlockId))
    val factors = Array.fill(inBlock.srcIds.length) {
      val factor = Array.fill(rank)(random.nextGaussian().toFloat)
      val nrm = blas.snrm2(rank, factor, 1)
      blas.sscal(rank, 1.0f / nrm, factor, 1)
      factor
    }
    (srcBlockId, factors)
  }
}


factor is ~ N(0, 1) and then scaled by the L2 norm, but it looks to me the
abs value is never taken before scaling which is what the comment
indicates...


On Mon, 27 Mar 2017 at 00:55 chris snow <ch...@gmail.com> wrote:

> In the paper “Large-Scale Parallel Collaborative Filtering for the
> Netflix Prize”, the following steps are described for ALS:
>
> Step 1 Initialize matrix M by assigning the average rating for that
> movie as the first row, and
> small random numbers for the remaining entries.
> Step 2 Fix M, Solve U by minimizing the objective function (the sum of
> squared errors);
> Step 3 Fix U, solve M by minimizing the objective function similarly;
> Step 4 Repeat Steps 2 and 3 until a stopping criterion is satisfied.
>
> Does spark take the average rating for the movie as the first row?
> I've looked through the source code, but I can't see the average
> rating being calculated for the movie.
>
> Many thanks,
>
> Chris
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>