You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nickolay (Jira)" <ji...@apache.org> on 2022/04/07 09:54:00 UTC

[jira] [Created] (SPARK-38816) Wrong code results in random matrix generator in spark-als algorithm

Nickolay created SPARK-38816:
--------------------------------

             Summary: Wrong code results in random matrix generator in spark-als algorithm 
                 Key: SPARK-38816
                 URL: https://issues.apache.org/jira/browse/SPARK-38816
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 3.2.1, 3.1.2, 3.1.1
            Reporter: Nickolay


In algorithm Spark ALS we need initialize nonegative factor matricies for users and items. 

In ALS:

 
{code:java}
private def initialize[ID](
    inBlocks: RDD[(Int, InBlock[ID])],
    rank: Int,
    seed: Long): RDD[(Int, FactorBlock)] = {
  // Choose a unit vector uniformly at random from the unit sphere, but from the
  // "first quadrant" where all elements are nonnegative. This can be done by choosing
  // elements distributed as Normal(0,1) and taking the absolute value, and then normalizing.
  // This appears to create factorizations that have a slightly better reconstruction
  // (<1%) compared picking elements uniformly at random in [0,1].
  inBlocks.mapPartitions({ iter =>
    iter.map {
      case (srcBlockId, inBlock) =>
        val random: XORShiftRandom = new XORShiftRandom(byteswap64(seed ^ srcBlockId))
        val factors: Array[Array[Float]] = Array.fill(inBlock.srcIds.length) {
          val factor = Array.fill(rank)(random.nextGaussian().toFloat)
          val nrm: Float = blas.snrm2(rank, factor, 1)
          blas.sscal(rank, 1.0f / nrm, factor, 1)
          factor
        }
        (srcBlockId, factors)
    }
  }, preservesPartitioning = true)
} {code}
In the comments, the author writes that we are generating a matrix filled with positive numbers. In the code we use random.nextGaussian().toFloat. But if we look at the documentation of the nextGaussian method, we can see that it also returns negative numbers: 

 
{code:java}
/** 
* @return the next pseudorandom, Gaussian ("normally") distributed
 *         {@code double} value with mean {@code 0.0} and
 *         standard deviation {@code 1.0} from this random number
 *         generator's sequence
 */
synchronized public double nextGaussian() {
    // See Knuth, ACP, Section 3.4.1 Algorithm C.
    if (haveNextNextGaussian) {
        haveNextNextGaussian = false;
        return nextNextGaussian;
    } else {
        double v1, v2, s;
        do {
            v1 = 2 * nextDouble() - 1; // between -1 and 1
            v2 = 2 * nextDouble() - 1; // between -1 and 1
            s = v1 * v1 + v2 * v2;
        } while (s >= 1 || s == 0);
        double multiplier = StrictMath.sqrt(-2 * StrictMath.log(s)/s);
        nextNextGaussian = v2 * multiplier;
        haveNextNextGaussian = true;
        return v1 * multiplier;
    }
}
 {code}
 

The result is a matrix with negative values



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org