You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/04/16 20:00:00 UTC

[jira] [Assigned] (SPARK-38816) Wrong comment in random matrix generator in spark-als algorithm

     [ https://issues.apache.org/jira/browse/SPARK-38816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-38816:
------------------------------------

    Assignee:     (was: Apache Spark)

> Wrong comment in random matrix generator in spark-als algorithm 
> ----------------------------------------------------------------
>
>                 Key: SPARK-38816
>                 URL: https://issues.apache.org/jira/browse/SPARK-38816
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.1, 3.1.2, 3.2.1
>            Reporter: Nickolay
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In algorithm Spark ALS we need initialize nonegative factor matricies for users and items. 
> In ALS:
>  
> {code:java}
> private def initialize[ID](
>     inBlocks: RDD[(Int, InBlock[ID])],
>     rank: Int,
>     seed: Long): RDD[(Int, FactorBlock)] = {
>   // Choose a unit vector uniformly at random from the unit sphere, but from the
>   // "first quadrant" where all elements are nonnegative. This can be done by choosing
>   // elements distributed as Normal(0,1) and taking the absolute value, and then normalizing.
>   // This appears to create factorizations that have a slightly better reconstruction
>   // (<1%) compared picking elements uniformly at random in [0,1].
>   inBlocks.mapPartitions({ iter =>
>     iter.map {
>       case (srcBlockId, inBlock) =>
>         val random: XORShiftRandom = new XORShiftRandom(byteswap64(seed ^ srcBlockId))
>         val factors: Array[Array[Float]] = Array.fill(inBlock.srcIds.length) {
>           val factor = Array.fill(rank)(random.nextGaussian().toFloat)
>           val nrm: Float = blas.snrm2(rank, factor, 1)
>           blas.sscal(rank, 1.0f / nrm, factor, 1)
>           factor
>         }
>         (srcBlockId, factors)
>     }
>   }, preservesPartitioning = true)
> } {code}
> In the comments, the author writes that we are generating a matrix filled with positive numbers. In the code we use random.nextGaussian().toFloat. But if we look at the documentation of the nextGaussian method, we can see that it also returns negative numbers: 
>  
> {code:java}
> /** 
> * @return the next pseudorandom, Gaussian ("normally") distributed
>  *         {@code double} value with mean {@code 0.0} and
>  *         standard deviation {@code 1.0} from this random number
>  *         generator's sequence
>  */
> synchronized public double nextGaussian() {
>     // See Knuth, ACP, Section 3.4.1 Algorithm C.
>     if (haveNextNextGaussian) {
>         haveNextNextGaussian = false;
>         return nextNextGaussian;
>     } else {
>         double v1, v2, s;
>         do {
>             v1 = 2 * nextDouble() - 1; // between -1 and 1
>             v2 = 2 * nextDouble() - 1; // between -1 and 1
>             s = v1 * v1 + v2 * v2;
>         } while (s >= 1 || s == 0);
>         double multiplier = StrictMath.sqrt(-2 * StrictMath.log(s)/s);
>         nextNextGaussian = v2 * multiplier;
>         haveNextNextGaussian = true;
>         return v1 * multiplier;
>     }
> }
>  {code}
>  
> The result is a matrix with negative values



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org