You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/03/02 20:07:00 UTC

[jira] [Resolved] (SPARK-26162) ALS results vary with user or item ID encodings

     [ https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-26162.
-------------------------------
    Resolution: Not A Problem

I don't think it's a bug; the distributed computation and blocking might produce slightly different results depending on the order it encounters the vectors. Reopen if you have more detail that shows the results are meaningfully different.

> ALS results vary with user or item ID encodings
> -----------------------------------------------
>
>                 Key: SPARK-26162
>                 URL: https://issues.apache.org/jira/browse/SPARK-26162
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Pablo J. Villacorta
>            Priority: Major
>
> When calling ALS.fit() with the same seed on a dataset, the results (both the latent factors matrices and the accuracy of the recommendations) differ when we change the labels used to encode the users or items. The code example below illustrates this by just changing user ID 1 or 2 to an unused ID like 30. The user factors matrix changes, but not only the rows corresponding to users 1 or 2 but also the other rows. 
> Is this the intended behaviour?
> {code:java}
> val r = scala.util.Random
> r.setSeed(123456)
> val trainDataset1 = spark.sparkContext.parallelize(
>     (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // users go from 0 to 19
> ).toDF("user", "item", "rating")
> val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 30).otherwise(col("user")))
> val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 30).otherwise(col("user")))
> val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
>     _.groupBy("user").agg(collect_list("item").alias("watched"))
> )
> val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, trainDataset3).map(new ALS().setSeed(12345).fit(_))
> als1.userFactors.show(5, false)
> als2.userFactors.show(5, false)
> als3.userFactors.show(5, false){code}
> If we ask for recommendations and compare them with a test dataset also modified accordingly (in this example, the test dataset is exactly the train dataset) the results also differ:
> {code:java}
> val recommendations = Array(als1, als2, als3).map(x =>
>     x.recommendForAllUsers(20).map{
>         case Row(user: Int, recommendations: WrappedArray[Row]) => {
>             val items = recommendations.map{case Row(item: Int, score: Float) => item}
>             (user, items)
>         }
>     }.toDF("user", "recommendations")
> )
> val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
>     case (testDataset, recommendationsDF) =>
>         testDataset.join(recommendationsDF, "user")
>             .rdd.map(r => {
>             (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
>                 r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
>             )
>         })
> }
> val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))
> println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}")
> println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}")
> println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}")
> {code}
> EDIT: The results also change if we just swap the IDs of some users, like:
> {code:java}
> val trainDataset4 = trainDataset1.withColumn("user", when(col("user")===1, 4)
>                                                     .when(col("user")===4, 1).otherwise(col("user")))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org