You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/02/28 18:53:42 UTC

[GitHub] spark pull request #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStra...

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/17102

    [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS

    [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage.
    
    ## How was this patch tested?
    
    Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark SPARK-19345-coldstart-doc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17102.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17102
    
----
commit baba319fae615ffc1ebfe564f9ac520e701fdf20
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-02-28T14:34:40Z

    Initial cold start param doc for user guide

commit db919d59b93ea6f9b8b423ea11d3d9c99ce43454
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-02-28T14:34:47Z

    Merge remote-tracking branch 'apache-github/master' into SPARK-19345-coldstart-doc

commit cd923e2791692d9dd81d7186033a1bfe22aab80d
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-02-28T14:47:03Z

    Update examples

commit 4c2c78c82101a2aec8f7f0634781869e1b4d0184
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-02-28T14:57:28Z

    Clean ip doc and add note about future strategies

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStra...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17102


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStra...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17102#discussion_r103757179
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -59,6 +59,34 @@ This approach is named "ALS-WR" and discussed in the paper
     It makes `regParam` less dependent on the scale of the dataset, so we can apply the
     best parameter learned from a sampled subset to the full dataset and expect similar performance.
     
    +### Cold-start strategy
    +
    +When making predictions using an `ALSModel`, it is common to encounter users and/or items in the 
    +test dataset that were not present during training the model. This typically occurs in two 
    +scenarios:
    +
    +1. In production, for new users or items that have no rating history and on which the model has not 
    +been trained (this is the "cold start problem")
    --- End diff --
    
    sure thing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStra...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17102#discussion_r103757139
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -59,6 +59,34 @@ This approach is named "ALS-WR" and discussed in the paper
     It makes `regParam` less dependent on the scale of the dataset, so we can apply the
     best parameter learned from a sampled subset to the full dataset and expect similar performance.
     
    +### Cold-start strategy
    +
    +When making predictions using an `ALSModel`, it is common to encounter users and/or items in the 
    +test dataset that were not present during training the model. This typically occurs in two 
    +scenarios:
    +
    +1. In production, for new users or items that have no rating history and on which the model has not 
    +been trained (this is the "cold start problem")
    +2. During cross-validation, the data is split between training and evaluation sets. When using 
    +simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually 
    +very common to encounter users and/or items in the evaluation set that are not in the training set
    +
    +By default, Spark assigns `NaN` predictions during `ALSModel.transform` when a user and/or item 
    +factor is not present in the model. This can be useful in a production system, since it indicates 
    +a new user or item, and so the system can make a decision on some fallback to use as the prediction.
    +
    +However, this is undesirable during cross-validation, since any `NaN` predicted values will result
    +in `NaN` results for the evaluation metric (for example when using `RegressionEvaluator`).
    +This makes model selection impossible.
    +
    +Spark allows users to set the `coldStartStrategy` parameter
    +to `drop` in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values. 
    +The evaluation metric will then be computed over the non-`NaN` data and will be valid. 
    +Usage of this parameter is illustrated in the example below.
    +
    +**Note:** currently the supported cold start strategies are `nan` (the default behavior mentioned 
    --- End diff --
    
    Yeah here I wanted to explicitly mention the "drop" option. Ideally will remove this note section when further strategies are added (like the average user vector idea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    **[Test build #73599 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73599/testReport)** for PR 17102 at commit [`4c2c78c`](https://github.com/apache/spark/commit/4c2c78c82101a2aec8f7f0634781869e1b4d0184).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73599/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    **[Test build #73599 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73599/testReport)** for PR 17102 at commit [`4c2c78c`](https://github.com/apache/spark/commit/4c2c78c82101a2aec8f7f0634781869e1b4d0184).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    **[Test build #73703 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73703/testReport)** for PR 17102 at commit [`c422d58`](https://github.com/apache/spark/commit/c422d5892fe3c8ed2fd8c4f3bf4978b9ced2bb02).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73703/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStra...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17102#discussion_r103731183
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -59,6 +59,34 @@ This approach is named "ALS-WR" and discussed in the paper
     It makes `regParam` less dependent on the scale of the dataset, so we can apply the
     best parameter learned from a sampled subset to the full dataset and expect similar performance.
     
    +### Cold-start strategy
    +
    +When making predictions using an `ALSModel`, it is common to encounter users and/or items in the 
    +test dataset that were not present during training the model. This typically occurs in two 
    +scenarios:
    +
    +1. In production, for new users or items that have no rating history and on which the model has not 
    +been trained (this is the "cold start problem")
    +2. During cross-validation, the data is split between training and evaluation sets. When using 
    +simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually 
    +very common to encounter users and/or items in the evaluation set that are not in the training set
    +
    +By default, Spark assigns `NaN` predictions during `ALSModel.transform` when a user and/or item 
    +factor is not present in the model. This can be useful in a production system, since it indicates 
    +a new user or item, and so the system can make a decision on some fallback to use as the prediction.
    +
    +However, this is undesirable during cross-validation, since any `NaN` predicted values will result
    +in `NaN` results for the evaluation metric (for example when using `RegressionEvaluator`).
    +This makes model selection impossible.
    +
    +Spark allows users to set the `coldStartStrategy` parameter
    +to `drop` in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values. 
    +The evaluation metric will then be computed over the non-`NaN` data and will be valid. 
    +Usage of this parameter is illustrated in the example below.
    +
    +**Note:** currently the supported cold start strategies are `nan` (the default behavior mentioned 
    --- End diff --
    
    A bit wary of putting the options explicitly here, but it seems hard to avoid since they're mentioned above. Even so, maybe use "drop" and "nan" (quotes).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" u...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17102
  
    **[Test build #73703 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73703/testReport)** for PR 17102 at commit [`c422d58`](https://github.com/apache/spark/commit/c422d5892fe3c8ed2fd8c4f3bf4978b9ced2bb02).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17102: [SPARK-19345][ML][DOC] Add doc for "coldStartStra...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17102#discussion_r103733577
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -59,6 +59,34 @@ This approach is named "ALS-WR" and discussed in the paper
     It makes `regParam` less dependent on the scale of the dataset, so we can apply the
     best parameter learned from a sampled subset to the full dataset and expect similar performance.
     
    +### Cold-start strategy
    +
    +When making predictions using an `ALSModel`, it is common to encounter users and/or items in the 
    +test dataset that were not present during training the model. This typically occurs in two 
    +scenarios:
    +
    +1. In production, for new users or items that have no rating history and on which the model has not 
    +been trained (this is the "cold start problem")
    --- End diff --
    
    nit: add punctuation (other places in the user guide have punctuation despite the fact that we are listing things)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org