You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ml...@apache.org on 2017/03/02 13:51:10 UTC
spark git commit: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS

Repository: spark
Updated Branches:
  refs/heads/master 50c08e82f -> 9cca3dbf4


[SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS

[SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage.

## How was this patch tested?

Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python.

Author: Nick Pentreath <ni...@za.ibm.com>

Closes #17102 from MLnick/SPARK-19345-coldstart-doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9cca3dbf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9cca3dbf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9cca3dbf

Branch: refs/heads/master
Commit: 9cca3dbf4add9004a769dee1a556987e37230294
Parents: 50c08e8
Author: Nick Pentreath <ni...@za.ibm.com>
Authored: Thu Mar 2 15:51:16 2017 +0200
Committer: Nick Pentreath <ni...@za.ibm.com>
Committed: Thu Mar 2 15:51:16 2017 +0200

----------------------------------------------------------------------
 docs/ml-collaborative-filtering.md              | 28 ++++++++++++++++++++
 .../spark/examples/ml/JavaALSExample.java       |  2 ++
 examples/src/main/python/ml/als_example.py      |  4 ++-
 .../apache/spark/examples/ml/ALSExample.scala   |  2 ++
 4 files changed, 35 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/9cca3dbf/docs/ml-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md
index cfe8351..58f2d4b 100644
--- a/docs/ml-collaborative-filtering.md
+++ b/docs/ml-collaborative-filtering.md
@@ -59,6 +59,34 @@ This approach is named "ALS-WR" and discussed in the paper
 It makes `regParam` less dependent on the scale of the dataset, so we can apply the
 best parameter learned from a sampled subset to the full dataset and expect similar performance.
 
+### Cold-start strategy
+
+When making predictions using an `ALSModel`, it is common to encounter users and/or items in the 
+test dataset that were not present during training the model. This typically occurs in two 
+scenarios:
+
+1. In production, for new users or items that have no rating history and on which the model has not 
+been trained (this is the "cold start problem").
+2. During cross-validation, the data is split between training and evaluation sets. When using 
+simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually 
+very common to encounter users and/or items in the evaluation set that are not in the training set
+
+By default, Spark assigns `NaN` predictions during `ALSModel.transform` when a user and/or item 
+factor is not present in the model. This can be useful in a production system, since it indicates 
+a new user or item, and so the system can make a decision on some fallback to use as the prediction.
+
+However, this is undesirable during cross-validation, since any `NaN` predicted values will result
+in `NaN` results for the evaluation metric (for example when using `RegressionEvaluator`).
+This makes model selection impossible.
+
+Spark allows users to set the `coldStartStrategy` parameter
+to "drop" in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values. 
+The evaluation metric will then be computed over the non-`NaN` data and will be valid. 
+Usage of this parameter is illustrated in the example below.
+
+**Note:** currently the supported cold start strategies are "nan" (the default behavior mentioned 
+above) and "drop". Further strategies may be supported in future.
+
 **Examples**
 
 <div class="codetabs">

http://git-wip-us.apache.org/repos/asf/spark/blob/9cca3dbf/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java
----------------------------------------------------------------------
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java
index 33ba668..81970b7 100644
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java
+++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java
@@ -103,6 +103,8 @@ public class JavaALSExample {
     ALSModel model = als.fit(training);
 
     // Evaluate the model by computing the RMSE on the test data
+    // Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
+    model.setColdStartStrategy("drop");
     Dataset<Row> predictions = model.transform(test);
 
     RegressionEvaluator evaluator = new RegressionEvaluator()

http://git-wip-us.apache.org/repos/asf/spark/blob/9cca3dbf/examples/src/main/python/ml/als_example.py
----------------------------------------------------------------------
diff --git a/examples/src/main/python/ml/als_example.py b/examples/src/main/python/ml/als_example.py
index 1a979ff..2e7214e 100644
--- a/examples/src/main/python/ml/als_example.py
+++ b/examples/src/main/python/ml/als_example.py
@@ -44,7 +44,9 @@ if __name__ == "__main__":
     (training, test) = ratings.randomSplit([0.8, 0.2])
 
     # Build the recommendation model using ALS on the training data
-    als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
+    # Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
+    als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
+              coldStartStrategy="drop")
     model = als.fit(training)
 
     # Evaluate the model by computing the RMSE on the test data

http://git-wip-us.apache.org/repos/asf/spark/blob/9cca3dbf/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala
----------------------------------------------------------------------
diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala
index bb5d163..868f49b 100644
--- a/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala
@@ -65,6 +65,8 @@ object ALSExample {
     val model = als.fit(training)
 
     // Evaluate the model by computing the RMSE on the test data
+    // Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
+    model.setColdStartStrategy("drop")
     val predictions = model.transform(test)
 
     val evaluator = new RegressionEvaluator()


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org