You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2016/05/04 10:33:16 UTC

[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/12896

    [SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy

    This PR adds a param to `ALS`/`ALSModel` to set the strategy used when encountering unknown users or items at prediction time in `transform`. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation.
    
    The current behavior returns `NaN` if a user/item is unknown. In scenario (b), this can easily occur when using `CrossValidator` or `TrainValidationSplit` since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns `NaN` for all metrics, making model selection impossible.
    
    The new param, `unknownStrategy`, defaults to `nan` (the current behavior). The other option supported initially is `drop`, which drops all rows with `NaN` predictions. This flag allows users to use `ALS` in cross-validation settings. It is made an `expertParam`. The param is made a string so that the set of strategies can be extended in future (some options are discussed in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)).
    
    ## How was this patch tested?
    
    New unit tests, and manual "before and after" tests for Scala & Python using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` or `TrainValidationSplit` with the default param setting results in metrics that are all `NaN`, while setting `unknownStrategy` to `drop` results in valid metrics. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark SPARK-14489-als-nan

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12896.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12896
    
----
commit 69a7ab31e91026bccc8f06eff9c98b452f09b5f2
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-05-04T08:19:21Z

    Scala param and tests

commit 82ccb2136a0052ac97b2560c850f221bd19073dd
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-05-04T09:51:59Z

    Python param and tests

commit 933526defd27256cfd3b59fac28276a50262d1bf
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-05-04T10:16:40Z

    Improve test a bit

commit 3287303f627a312b7fccf6e7d47151938df21966
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-05-04T10:22:03Z

    Doc tweak

commit fc437451a598221f0878b7a2e0b87d17572019cc
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-05-04T10:23:33Z

    Doc remove space

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62070510
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    +   * - "nan": prediction value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    +   * Default: "nan".
    +   * @group expertParam
    +   */
    +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    --- End diff --
    
    I personally don't see much difference between "new" and "unknown". Will
    think about it some more
    
    On Wed, 4 May 2016 at 18:08, Seth Hendrickson <no...@github.com>
    wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62067123>:
    >
    > > @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
    > >
    > >    /** @group getParam */
    > >    def getItemCol: String = $(itemCol)
    > > +
    > > +  /**
    > > +   * Param for strategy for dealing with unknown users or items at prediction time.
    > > +   * Supported values:
    > > +   * - "nan": prediction value for unknown ids will be NaN.
    > > +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    > > +   * Default: "nan".
    > > +   * @group expertParam
    > > +   */
    > > +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    >
    > Thinking about it more, using "unknown" just seems confusing.
    > "newIdPredictionStrategy" ? I don't have a great idea for it.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62067123>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62086416
  
    --- Diff: python/pyspark/ml/recommendation.py ---
    @@ -127,27 +127,31 @@ class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, Ha
                                   "StorageLevel for ALS model factors. " +
                                   "Default: 'MEMORY_AND_DISK'.",
                                   typeConverter=TypeConverters.toString)
    +    unknownStrategy = Param(Params._dummy(), "unknownStrategy",
    +                            "strategy for dealing with unknown users or items at prediction " +
    +                            "time. Supported values: 'nan', 'drop'.",
    --- End diff --
    
    Mention the default value here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by dzhus <gi...@git.apache.org>.
Github user dzhus commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Excuse me for barging in, but speaking of compatibility and ALS cross-validation, am I correct that RegressionEvaluator is currently unaware of special logic required for RMSE to be at least remotely helpful with implicit ALS, as discussed in apache/spark#597? I am getting RMSE of well over 1 using `RegressionEvaluator` on implicit ALS training results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #73164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73164/testReport)** for PR 12896 at commit [`9981fd8`](https://github.com/apache/spark/commit/9981fd800fe9c8b687e79f536cbf691c75fec640).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62907308
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    --- End diff --
    
    ok, will update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62067878
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    --- End diff --
    
    Should we add a note here about this being useful/necessary when performing cross validation or evaluating the model on unseen data? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62096297
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -268,6 +294,10 @@ class ALSModel private[ml] (
     @Since("1.6.0")
     object ALSModel extends MLReadable[ALSModel] {
     
    +  private val NaN = "nan"
    +  private val Drop = "drop"
    +  private[recommendation] final val supportedUnknownStrategies = Array(NaN, Drop)
    --- End diff --
    
    Instead of adding an array here, could you just create it in the `ParamValidator` when defining the `Param`?
    ```scala
    ParamValidators.inArray(Array(ALSModel.NaN, ALSModel.Drop))
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62598/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    cc also @xsankar who was interested on the related JIRAs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62066768
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    I see a similar test was added for the storage level params, but I don't exactly understand why we need these. We don't implement these types of tests for other classes and params and this is just checking the getters and setters. Also, the `fit` call seems unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-216824282
  
    cc @sethah @holdenk @srowen @jkbradley


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #62599 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62599/consoleFull)** for PR 12896 at commit [`6599372`](https://github.com/apache/spark/commit/6599372bf67e372bdc0d2e1f63c57d7ca74f2c60).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62059332
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    +   * - "nan": prediction value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    +   * Default: "nan".
    +   * @group expertParam
    +   */
    +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    --- End diff --
    
    I don't like the name "unknownStrategy" - seems very ambiguous. I'm not sure what is best, but for example I think "unknownPredictionStrategy" or "newUserItemPredictionStrategy" are better. Just something that will be obvious what it is by the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #71592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71592/testReport)** for PR 12896 at commit [`a439899`](https://github.com/apache/spark/commit/a4398995fbd3180f04ef1837113ce88a8703a6ee).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62889428
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    --- End diff --
    
    Any reason why "nan" rather than "NaN"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62070073
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -241,11 +261,17 @@ class ALSModel private[ml] (
             Float.NaN
           }
         }
    -    dataset
    +    val predictions = dataset
           .join(userFactors, dataset($(userCol)) === userFactors("id"), "left")
           .join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left")
           .select(dataset("*"),
             predict(userFactors("features"), itemFactors("features")).as($(predictionCol)))
    +    $(unknownStrategy) match {
    +      case ALSModel.Drop =>
    --- End diff --
    
    Sure, I just think it's easier to maintain with changes to the options
    occurring in one place, just for avoiding duplication. But it doesn't make
    a big difference
    On Wed, 4 May 2016 at 18:17, Seth Hendrickson <no...@github.com>
    wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62068526>:
    >
    > >        .join(userFactors, dataset($(userCol)) === userFactors("id"), "left")
    > >        .join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left")
    > >        .select(dataset("*"),
    > >          predict(userFactors("features"), itemFactors("features")).as($(predictionCol)))
    > > +    $(unknownStrategy) match {
    > > +      case ALSModel.Drop =>
    >
    > For my own curiosity, is there a reason we can't just match on the string
    > value and avoid creating these new vals? Seems like I have seen both ways
    > in the code.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62068526>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71592/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-217627900
  
    **[Test build #58064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58064/consoleFull)** for PR 12896 at commit [`b71dcac`](https://github.com/apache/spark/commit/b71dcaca39d5f10b484932eaf25e290c7796d93b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #63535 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63535/consoleFull)** for PR 12896 at commit [`9ff2a0a`](https://github.com/apache/spark/commit/9ff2a0a1d2e46fda6f794ce1dd37fd852b4ddc06).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-217629596
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58064/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62890352
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    --- End diff --
    
    Nope - easier to type? :) also seems consistent with most other string
    params which seem to be lower case (eg "rmse", "map" etc).
    On Wed, 11 May 2016 at 19:24, Holden Karau <no...@github.com> wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62889428>:
    >
    > > @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
    > >
    > >    /** @group getParam */
    > >    def getItemCol: String = $(itemCol)
    > > +
    > > +  /**
    > > +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    > > +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    > > +   * the model has not seen in the training data.
    > > +   * Supported values:
    > > +   * - "nan": predicted value for unknown ids will be NaN.
    >
    > Any reason why "nan" rather than "NaN"?
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/b71dcaca39d5f10b484932eaf25e290c7796d93b#r62889428>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62163670
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    @BryanCutler yeah, it would fail when `fit` is called - on the JVM side actually as that is where the validation occurs. It's due to the wrapper as you say, not much we can do on the Python side for now.
    
    Since we test invalid params in the Scala suite, I don't think it's necessary to test them again here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-216831469
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-216831388
  
    **[Test build #57753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57753/consoleFull)** for PR 12896 at commit [`fc43745`](https://github.com/apache/spark/commit/fc437451a598221f0878b7a2e0b87d17572019cc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62888342
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    --- End diff --
    
    Fair enough. Will reword this a bit.
    On Wed, 11 May 2016 at 17:25, Seth Hendrickson <no...@github.com>
    wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62869002>:
    >
    > > @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
    > >
    > >    /** @group getParam */
    > >    def getItemCol: String = $(itemCol)
    > > +
    > > +  /**
    > > +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    > > +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    > > +   * the model has not seen in the training data.
    > > +   * Supported values:
    > > +   * - "nan": predicted value for unknown ids will be NaN.
    > > +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    >
    > "will be dropped." -> "will not be included in the recommendations."
    >
    > It could sound like we are modifying the input dataframe, but we are not.
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/b71dcaca39d5f10b484932eaf25e290c7796d93b#r62869002>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62095795
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    +   * - "nan": prediction value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    +   * Default: "nan".
    +   * @group expertParam
    +   */
    +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    --- End diff --
    
    I think "unknownIdPredictionStrategy" or just "unknownIdStrategy" sound fine - they are unknown to the trained model right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Another option is to make `predictionCol` nullable and return `null` predictions. The `drop` strategy can still apply (though it will need to be a custom `filter` rather than `df.na.drop`), but it makes it totally clear when a prediction is "missing" vs `NaN`. 
    
    However, is it even possible to get a bunch of `NaN`s, e.g. if the model somehow diverged (I don't think that's even possible with ALS?). So, this may just add needless complexity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-217627870
  
    I think I may create a new example for ALS cross-validation (or port the Movielens example from mllib even though ml won't yet support top-k recs) to illustrate the usage of this param.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-217627762
  
    Changed param name to `coldStartStrategy`, added a little more doc to the param, and removed the Python tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-216831471
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57753/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62198643
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    +   * - "nan": prediction value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    +   * Default: "nan".
    +   * @group expertParam
    +   */
    +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    --- End diff --
    
    @MLnick I think it's good. It's an "expert" param anyway. Also, it's better in a way because it's hard to _assume_ what it means. For those that don't know it, they can check out the doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62067123
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    +   * - "nan": prediction value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    +   * Default: "nan".
    +   * @group expertParam
    +   */
    +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    --- End diff --
    
    Thinking about it more, using "unknown" just seems confusing. "newIdPredictionStrategy" ? I don't have a great idea for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62890571
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    --- End diff --
    
    Actually did think about making it case insensitive - I've seen that in
    some DT params. But figured it's probably not worth it.
    
    On Wed, 11 May 2016 at 19:24, Holden Karau <no...@github.com> wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62889428>:
    >
    > > @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
    > >
    > >    /** @group getParam */
    > >    def getItemCol: String = $(itemCol)
    > > +
    > > +  /**
    > > +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    > > +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    > > +   * the model has not seen in the training data.
    > > +   * Supported values:
    > > +   * - "nan": predicted value for unknown ids will be NaN.
    >
    > Any reason why "nan" rather than "NaN"?
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/b71dcaca39d5f10b484932eaf25e290c7796d93b#r62889428>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69401/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Yes that sounds like the direction this is heading. In a way, "drop" and "nan" aren't really different cold start strategies. Both return an implicit or explicit "I don't know; ask something else". One of these wouldn't have a counterpart in an API that offers point predictions. 
    
    "drop" and "nan" have distinct use in the context of evaluation, and it sounds like we'd even get rid of "nan" except for potential backwards compatibility. But, maybe that's just fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62070675
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -486,6 +486,39 @@ class ALSSuite
         assert(getFactors(model.userFactors) === getFactors(model2.userFactors))
         assert(getFactors(model.itemFactors) === getFactors(model2.itemFactors))
       }
    +
    +  test("ALS unknown user/item prediction strategy") {
    +    val sqlContext = this.sqlContext
    +    import sqlContext.implicits._
    +    import org.apache.spark.sql.functions._
    +
    +    val (ratings, _) = genExplicitTestData(numUsers = 4, numItems = 4, rank = 1)
    +    val data = ratings.toDF
    +    val knownUser = data.select(max("user")).as[Int].first()
    +    val unknownUser = knownUser + 10
    +    val knownItem = data.select(max("item")).as[Int].first()
    +    val unknownItem = knownItem + 20
    +    val test = Seq(
    +      (unknownUser, unknownItem),
    +      (knownUser, unknownItem),
    +      (unknownUser, knownItem),
    +      (knownUser, knownItem)
    +    ).toDF("user", "item")
    +
    +    val als = new ALS().setMaxIter(1).setRank(1)
    +    // default is 'nan'
    +    val defaultModel = als.fit(data)
    +    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    +    assert(defaultPredictions.length == 4)
    +    defaultPredictions.slice(0, 3).foreach(p => assert(p.isNaN))
    +    assert(!defaultPredictions.last.isNaN)
    +
    +    // check 'drop' strategy should filter out rows with unknown users/items
    +    val dropModel = als.setUnknownStrategy("drop").fit(data)
    --- End diff --
    
    True the fit is not strictly necessary.
    
    On Wed, 4 May 2016 at 17:52, Seth Hendrickson <no...@github.com>
    wrote:
    
    > In mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62064543>:
    >
    > > +      (unknownUser, unknownItem),
    > > +      (knownUser, unknownItem),
    > > +      (unknownUser, knownItem),
    > > +      (knownUser, knownItem)
    > > +    ).toDF("user", "item")
    > > +
    > > +    val als = new ALS().setMaxIter(1).setRank(1)
    > > +    // default is 'nan'
    > > +    val defaultModel = als.fit(data)
    > > +    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    > > +    assert(defaultPredictions.length == 4)
    > > +    defaultPredictions.slice(0, 3).foreach(p => assert(p.isNaN))
    > > +    assert(!defaultPredictions.last.isNaN)
    > > +
    > > +    // check 'drop' strategy should filter out rows with unknown users/items
    > > +    val dropModel = als.setUnknownStrategy("drop").fit(data)
    >
    > I don't see a reason to call fit a second time since the value only has
    > effect in the transformer. This only checks that the param get transferred
    > to the transformer properly, right?
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62064543>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #71838 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71838/testReport)** for PR 12896 at commit [`9981fd8`](https://github.com/apache/spark/commit/9981fd800fe9c8b687e79f536cbf691c75fec640).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-217629595
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73521/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62085392
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    Yup, fair point. Me being overly conservative, I'll remove these.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62068526
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -241,11 +261,17 @@ class ALSModel private[ml] (
             Float.NaN
           }
         }
    -    dataset
    +    val predictions = dataset
           .join(userFactors, dataset($(userCol)) === userFactors("id"), "left")
           .join(itemFactors, dataset($(itemCol)) === itemFactors("id"), "left")
           .select(dataset("*"),
             predict(userFactors("features"), itemFactors("features")).as($(predictionCol)))
    +    $(unknownStrategy) match {
    +      case ALSModel.Drop =>
    --- End diff --
    
    For my own curiosity, is there a reason we can't just match on the string value and avoid creating these new vals? Seems like I have seen both ways in the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-217629581
  
    **[Test build #58064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58064/consoleFull)** for PR 12896 at commit [`b71dcac`](https://github.com/apache/spark/commit/b71dcaca39d5f10b484932eaf25e290c7796d93b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-216824788
  
    **[Test build #57753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57753/consoleFull)** for PR 12896 at commit [`fc43745`](https://github.com/apache/spark/commit/fc437451a598221f0878b7a2e0b87d17572019cc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62098570
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    I agree with @sethah that testing the setting/transferring of the Params here is not needed and would be better to leave to the wrapper class to test.  Also, `_java_obj` is supposed to be a "private" variable and maybe not good for the end user to think it's ok to use directly.
    
    Perhaps a more useful test could be what happens if the user sets the Param to an unsupported value, e.g.
    ```py
    als.setUnknownStrategy("duh")
    ```
    Is the expected behaviour that it would fail when `fit` is called?  If so, I don't think that's great, but that's more of a problem of the wrapper class anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/12896#issuecomment-218555279
  
    @srowen @jkbradley - this is technically not a "bug", but it impacts usability of `CrossValidator` / `TrainValidationSplit` for users of `ALS`. As such I'd like to have it in 2.0, if you have time to take a quick pass. I don't feel super strongly about that if you prefer to leave it for 2.1 though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    OK. I can see that this is a bit of a band-aid, but might still be implemented as part of the beginnings of a more general 'cold start' configuration. As long as the API itself makes sense in the long term, and it seems so.
    
    "drop" really drops any prediction that's NaN, not necessarily only for unknown users and items. I suppose the assumption is that those are the same cases, which is probably true in practice now.
    
    What would it mean if a method was later added to make point predictions? you'd get nothing back, or maybe this would be considered synonymous with NaN in this case and get NaN back.
    
    It's a little stretch to implement the required behavior here as a cold-start strategy but not terrible. If you're OK with it go for it. I think there are a few more comments to resolve about the `@since` tags 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12896


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    As for leaving the current behavior - it's more for compatibility. Though as I mention above I believe returning `NaN` is reasonable for live scoring purposes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71838/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62165258
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    +   * - "nan": prediction value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    +   * Default: "nan".
    +   * @group expertParam
    +   */
    +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    --- End diff --
    
    How about "coldStartStrategy"? For people familiar with recommendations "cold start" is clear in meaning. But it may be less clear for people less familiar.
    
    @srowen what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62224489
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    yeah, that's true


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62905493
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    --- End diff --
    
    +1 for case insensitivity


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62086318
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    I think its probably good habit to add tests for new complex params we add (but probably not worth it to go back and fill it in). I know that as a user when the docs don't explain stuff closely enough turning to the tests is often a good fall back (so we can think of it as also helping with documentation in a way).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    @srowen @jkbradley any further comments / issues? I plan to create follow-up JIRAs for docs & examples, as well as expansions to the cold-start strategies.
    
    also cc @mengxr @yanboliang for any comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #71838 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71838/testReport)** for PR 12896 at commit [`9981fd8`](https://github.com/apache/spark/commit/9981fd800fe9c8b687e79f536cbf691c75fec640).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item ...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r72931611
  
    --- Diff: python/pyspark/ml/recommendation.py ---
    @@ -332,6 +338,20 @@ def getFinalStorageLevel(self):
             """
             return self.getOrDefault(self.finalStorageLevel)
     
    +    @since("2.0.0")
    --- End diff --
    
    Ah yeah I totally forgot to update these. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62064335
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    +   * - "nan": prediction value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    +   * Default: "nan".
    +   * @group expertParam
    +   */
    +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    --- End diff --
    
    Fair enough - I was trying to keep it brief. But it could be more clear.
    How about "unknownIdPredictionStrategy"?
    On Wed, 4 May 2016 at 17:23, Seth Hendrickson <no...@github.com>
    wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62059332>:
    >
    > > @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
    > >
    > >    /** @group getParam */
    > >    def getItemCol: String = $(itemCol)
    > > +
    > > +  /**
    > > +   * Param for strategy for dealing with unknown users or items at prediction time.
    > > +   * Supported values:
    > > +   * - "nan": prediction value for unknown ids will be NaN.
    > > +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    > > +   * Default: "nan".
    > > +   * @group expertParam
    > > +   */
    > > +  val unknownStrategy = new Param[String](this, "unknownStrategy",
    >
    > I don't like the name "unknownStrategy" - seems very ambiguous. I'm not
    > sure what is best, but for example I think "unknownPredictionStrategy" or
    > "newUserItemPredictionStrategy" are better. Just something that will be
    > obvious what it is by the name.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62059332>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62091677
  
    --- Diff: python/pyspark/ml/recommendation.py ---
    @@ -127,27 +127,31 @@ class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, Ha
                                   "StorageLevel for ALS model factors. " +
                                   "Default: 'MEMORY_AND_DISK'.",
                                   typeConverter=TypeConverters.toString)
    +    unknownStrategy = Param(Params._dummy(), "unknownStrategy",
    +                            "strategy for dealing with unknown users or items at prediction " +
    +                            "time. Supported values: 'nan', 'drop'.",
    --- End diff --
    
    This is an interesting one, as I've been doing that (for the storage params), but actually when you do `explainParams` it gets duplicated so looks weird. So I think it may be best left out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item ...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r72916851
  
    --- Diff: python/pyspark/ml/recommendation.py ---
    @@ -332,6 +338,20 @@ def getFinalStorageLevel(self):
             """
             return self.getOrDefault(self.finalStorageLevel)
     
    +    @since("2.0.0")
    --- End diff --
    
    Super minor - but we probably want to bump the since annotations here (looks like you already got the Scala ones).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #62599 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62599/consoleFull)** for PR 12896 at commit [`6599372`](https://github.com/apache/spark/commit/6599372bf67e372bdc0d2e1f63c57d7ca74f2c60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    I think there is a fair bit of difference between cross-validating the model and scoring in production.
    
    In most practical live scoring situations, there may be multiple levels of fallbacks / defaults for the cold-start case (including e.g. "most popular", "newest", "content-based methods" etc etc). There may also be various post-processing steps applied to the results. I don't think it's feasible to re-create live behaviour perfectly for cross-validation scenarios (especially as these systems are often totally different from Spark).
    
    Even for offline bulk scoring, again there may be many different options for cold start. Do we intend to support all of them within Spark? Again I don't think that's feasible, though as discussed on the JIRA we can certainly support a few useful options, such as "average user" which could indeed serve for both CV and live scoring purposes.
    
    I actually think `NaN` for live scoring is "better" than say `0`, because then it is very clear that it's a missing data point (which the system can choose how to handle) rather than a prediction of `0`.
    
    For CV, I'd expect that predicting `0` would have a dramatic negative impact on RMSE. So for CV I'd say the `drop` option is more reasonable.
    
    This is not arguing against other reasonable options (average rating, average user vectors and so on) - we can add those later on user demand. This is just a start. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Your suggestion is, to me, the ideal solution. It's probably the more common method of splitting "ratings" datasets for CV purposes.
    
    I'm interested in working on it but I think it would be a whole new specific cross-validator class. I'm not quite sure what the best approach is for efficiency (refer #14321 for stratified sampling approach, it's more for labels and is not efficient for this case, but the general concept might apply). In short, it's obviously a lot more effort and will take time. Perhaps it also starts life outside of Spark in packages. Not sure on this yet, but happy to collaborate on ideas!
    
    Originally this PR was intended for `2.0` to at least make ALS useable with the CV classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #73521 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73521/testReport)** for PR 12896 at commit [`041a3b3`](https://github.com/apache/spark/commit/041a3b3fa3c93f129d4eeb9846b5e2dba769ca95).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #69401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69401/consoleFull)** for PR 12896 at commit [`a439899`](https://github.com/apache/spark/commit/a4398995fbd3180f04ef1837113ce88a8703a6ee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #73164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73164/testReport)** for PR 12896 at commit [`9981fd8`](https://github.com/apache/spark/commit/9981fd800fe9c8b687e79f536cbf691c75fec640).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Hm, let me go through the logic with you one more time here. Isn't it better in theory to fix the model to not return NaN, but rather return _some_ default answer, even if it's "0" or equivalent? this is at least no worse for scoring, and fixes the evaluation problem. New users and items are reasonable conditions for this model, not an error case.
    
    The current behavior isn't that helpful, so I'm not sure leaving it as a choice is doing anybody a favor. My concern with the "drop" mode is that it is not penalizing any case where the model can't make an answer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62064543
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -486,6 +486,39 @@ class ALSSuite
         assert(getFactors(model.userFactors) === getFactors(model2.userFactors))
         assert(getFactors(model.itemFactors) === getFactors(model2.itemFactors))
       }
    +
    +  test("ALS unknown user/item prediction strategy") {
    +    val sqlContext = this.sqlContext
    +    import sqlContext.implicits._
    +    import org.apache.spark.sql.functions._
    +
    +    val (ratings, _) = genExplicitTestData(numUsers = 4, numItems = 4, rank = 1)
    +    val data = ratings.toDF
    +    val knownUser = data.select(max("user")).as[Int].first()
    +    val unknownUser = knownUser + 10
    +    val knownItem = data.select(max("item")).as[Int].first()
    +    val unknownItem = knownItem + 20
    +    val test = Seq(
    +      (unknownUser, unknownItem),
    +      (knownUser, unknownItem),
    +      (unknownUser, knownItem),
    +      (knownUser, knownItem)
    +    ).toDF("user", "item")
    +
    +    val als = new ALS().setMaxIter(1).setRank(1)
    +    // default is 'nan'
    +    val defaultModel = als.fit(data)
    +    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    +    assert(defaultPredictions.length == 4)
    +    defaultPredictions.slice(0, 3).foreach(p => assert(p.isNaN))
    +    assert(!defaultPredictions.last.isNaN)
    +
    +    // check 'drop' strategy should filter out rows with unknown users/items
    +    val dropModel = als.setUnknownStrategy("drop").fit(data)
    --- End diff --
    
    I don't see a reason to call `fit` a second time since the value only has effect in the transformer. This only checks that the param get transferred to the transformer properly, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #62598 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62598/consoleFull)** for PR 12896 at commit [`0d0278a`](https://github.com/apache/spark/commit/0d0278a9d8d3996929ef5fce8c815c87b81f9536).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    This looks good to me pending since annotation bump in Python. Personally I think a docs update could be a good follow up - but since were tagging it as an expert param for now I think this makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62599/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    LGTM pending since tags. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #68925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68925/consoleFull)** for PR 12896 at commit [`a439899`](https://github.com/apache/spark/commit/a4398995fbd3180f04ef1837113ce88a8703a6ee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68925/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62869002
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    +   * - "drop": rows in the input DataFrame containing unknown ids will be dropped.
    --- End diff --
    
    "will be dropped." -> "will not be included in the recommendations."
    
    It could sound like we are modifying the input dataframe, but we are not. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62163779
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -268,6 +294,10 @@ class ALSModel private[ml] (
     @Since("1.6.0")
     object ALSModel extends MLReadable[ALSModel] {
     
    +  private val NaN = "nan"
    +  private val Drop = "drop"
    +  private[recommendation] final val supportedUnknownStrategies = Array(NaN, Drop)
    --- End diff --
    
    It would be ok, but I use `supportedUnknownStrategies` in the param comment for "supported values" too, so I think this might be best.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73164/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #71592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71592/testReport)** for PR 12896 at commit [`a439899`](https://github.com/apache/spark/commit/a4398995fbd3180f04ef1837113ce88a8703a6ee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62096512
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -486,6 +486,39 @@ class ALSSuite
         assert(getFactors(model.userFactors) === getFactors(model2.userFactors))
         assert(getFactors(model.itemFactors) === getFactors(model2.itemFactors))
       }
    +
    +  test("ALS unknown user/item prediction strategy") {
    +    val sqlContext = this.sqlContext
    +    import sqlContext.implicits._
    +    import org.apache.spark.sql.functions._
    +
    +    val (ratings, _) = genExplicitTestData(numUsers = 4, numItems = 4, rank = 1)
    +    val data = ratings.toDF
    +    val knownUser = data.select(max("user")).as[Int].first()
    +    val unknownUser = knownUser + 10
    +    val knownItem = data.select(max("item")).as[Int].first()
    +    val unknownItem = knownItem + 20
    +    val test = Seq(
    +      (unknownUser, unknownItem),
    +      (knownUser, unknownItem),
    +      (unknownUser, knownItem),
    +      (knownUser, knownItem)
    +    ).toDF("user", "item")
    +
    +    val als = new ALS().setMaxIter(1).setRank(1)
    +    // default is 'nan'
    +    val defaultModel = als.fit(data)
    +    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    +    assert(defaultPredictions.length == 4)
    +    defaultPredictions.slice(0, 3).foreach(p => assert(p.isNaN))
    --- End diff --
    
    no big deal, but slightly cleaner 
    ```scala
    defaultPredictions.slice(0, 3).foreach(_.isNaN)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62070302
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown users or items at prediction time.
    +   * Supported values:
    --- End diff --
    
    I'm thinking I will create an example, or perhaps expand alsexample to
    include cross validation, illustrating this.
    
    It may make sense to add something in the comment.
    
    On Wed, 4 May 2016 at 18:11, Seth Hendrickson <no...@github.com>
    wrote:
    
    > In mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
    > <https://github.com/apache/spark/pull/12896#discussion_r62067878>:
    >
    > > @@ -71,6 +71,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
    > >
    > >    /** @group getParam */
    > >    def getItemCol: String = $(itemCol)
    > > +
    > > +  /**
    > > +   * Param for strategy for dealing with unknown users or items at prediction time.
    > > +   * Supported values:
    >
    > Should we add a note here about this being useful/necessary when
    > performing cross validation or evaluating the model on unseen data?
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62067878>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Sean, let's assume we ultimately have the following available:
    * `drop` - drop `NaN`
    * `average` - return the average rating for new users/items
    * `average_factor` - predict using the average factor vector for new users/items
    
    Would you suggest that this a "good" end result? Or would you suggest to not have the `drop` option available in this list? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62900751
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -71,6 +71,26 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
     
       /** @group getParam */
       def getItemCol: String = $(itemCol)
    +
    +  /**
    +   * Param for strategy for dealing with unknown or new users/items at prediction time.
    +   * This may be useful in cross-validation or production scenarios, for handling user/item ids
    +   * the model has not seen in the training data.
    +   * Supported values:
    +   * - "nan": predicted value for unknown ids will be NaN.
    --- End diff --
    
    It _might_ be worth it to make it cae insensitive since NaN is more commonly typed as NaN (and the other abbreviations used for algorithms are more commonly lower case).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62071023
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    The params on the Java side are only set when fit is called afaik.
    
    But yeah the tests are not strictly necessary. I could add a doc string
    test for it. My only concern (as with the storage params) is I didn't want
    to highlight what is a bit more of an expertparam
    
    On Wed, 4 May 2016 at 18:05, Seth Hendrickson <no...@github.com>
    wrote:
    
    > In python/pyspark/ml/tests.py
    > <https://github.com/apache/spark/pull/12896#discussion_r62066768>:
    >
    > > @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
    > >          self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
    > >          self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
    > >
    > > +    def test_unknown_strategy(self):
    >
    > I see a similar test was added for the storage level params, but I don't
    > exactly understand why we need these. We don't implement these types of
    > tests for other classes and params and this is just checking the getters
    > and setters. Also, the fit call seems unnecessary.
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/12896/files/fc437451a598221f0878b7a2e0b87d17572019cc#r62066768>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62227681
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -486,6 +486,39 @@ class ALSSuite
         assert(getFactors(model.userFactors) === getFactors(model2.userFactors))
         assert(getFactors(model.itemFactors) === getFactors(model2.itemFactors))
       }
    +
    +  test("ALS unknown user/item prediction strategy") {
    +    val sqlContext = this.sqlContext
    +    import sqlContext.implicits._
    +    import org.apache.spark.sql.functions._
    +
    +    val (ratings, _) = genExplicitTestData(numUsers = 4, numItems = 4, rank = 1)
    +    val data = ratings.toDF
    +    val knownUser = data.select(max("user")).as[Int].first()
    +    val unknownUser = knownUser + 10
    +    val knownItem = data.select(max("item")).as[Int].first()
    +    val unknownItem = knownItem + 20
    +    val test = Seq(
    +      (unknownUser, unknownItem),
    +      (knownUser, unknownItem),
    +      (unknownUser, knownItem),
    +      (knownUser, knownItem)
    +    ).toDF("user", "item")
    +
    +    val als = new ALS().setMaxIter(1).setRank(1)
    +    // default is 'nan'
    +    val defaultModel = als.fit(data)
    +    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    +    assert(defaultPredictions.length == 4)
    +    defaultPredictions.slice(0, 3).foreach(p => assert(p.isNaN))
    --- End diff --
    
    ooops, forgot about the assert - the way you have it is good!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63535/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    For scoring, returning "I don't know" could be reasonable in order to let some other logic take over. I could see that. It would be important to do that, but maybe a separable concern. For evaluation, hm, if the goal is to ignore cases where the result will reasonably be "I don't know" then what about constructing the evaluation differently, so that the test set doesn't have users that the train set doesn't for example? that's more sound, because then any 'bad' NaNs still show up. Is it that this is hard to implement in the current structure?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    ping @sethah @holdenk @srowen @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #73521 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73521/testReport)** for PR 12896 at commit [`041a3b3`](https://github.com/apache/spark/commit/041a3b3fa3c93f129d4eeb9846b5e2dba769ca95).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    RMSE is just RMSE, regardless of what data it's applied to. You're right that RMSE should be <= 1 if all input/outputs are in [0,1], but that's not actually the case in the implicit model, where predictions  can actually be outside this range. Really, RMSE is not the right metric for implicit results to begin with. You would want ranking metrics for this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #63535 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63535/consoleFull)** for PR 12896 at commit [`9ff2a0a`](https://github.com/apache/spark/commit/9ff2a0a1d2e46fda6f794ce1dd37fd852b4ddc06).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Reviving after a hiatus. Updated since tags. I've actually recently come across a number of users hitting this issue in production and are unable to use ALS with cross-validation as a result.
    
    Added [SPARK-19345](https://issues.apache.org/jira/browse/SPARK-19345) for adding detail to the docs / examples and [SPARK-19346](https://issues.apache.org/jira/browse/SPARK-19346) for adding the advanced strategies discussed above.
    
    ping @sethah @srowen @jkbradley in case you have any more comment. I'll merge in a day or two if no major comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62164013
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -486,6 +486,39 @@ class ALSSuite
         assert(getFactors(model.userFactors) === getFactors(model2.userFactors))
         assert(getFactors(model.itemFactors) === getFactors(model2.itemFactors))
       }
    +
    +  test("ALS unknown user/item prediction strategy") {
    +    val sqlContext = this.sqlContext
    +    import sqlContext.implicits._
    +    import org.apache.spark.sql.functions._
    +
    +    val (ratings, _) = genExplicitTestData(numUsers = 4, numItems = 4, rank = 1)
    +    val data = ratings.toDF
    +    val knownUser = data.select(max("user")).as[Int].first()
    +    val unknownUser = knownUser + 10
    +    val knownItem = data.select(max("item")).as[Int].first()
    +    val unknownItem = knownItem + 20
    +    val test = Seq(
    +      (unknownUser, unknownItem),
    +      (knownUser, unknownItem),
    +      (unknownUser, knownItem),
    +      (knownUser, knownItem)
    +    ).toDF("user", "item")
    +
    +    val als = new ALS().setMaxIter(1).setRank(1)
    +    // default is 'nan'
    +    val defaultModel = als.fit(data)
    +    val defaultPredictions = defaultModel.transform(test).select("prediction").as[Float].collect()
    +    assert(defaultPredictions.length == 4)
    +    defaultPredictions.slice(0, 3).foreach(p => assert(p.isNaN))
    --- End diff --
    
    it doesn't actually work with the `assert`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #62598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62598/consoleFull)** for PR 12896 at commit [`0d0278a`](https://github.com/apache/spark/commit/0d0278a9d8d3996929ef5fce8c815c87b81f9536).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14489][ML][PYSPARK] ALS unknown user/it...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12896#discussion_r62083620
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -1027,6 +1027,21 @@ def test_storage_levels(self):
             self.assertEqual(als.getFinalStorageLevel(), "DISK_ONLY")
             self.assertEqual(als._java_obj.getFinalStorageLevel(), "DISK_ONLY")
     
    +    def test_unknown_strategy(self):
    --- End diff --
    
    I don't think we need to test this at all. We already have tests to check the param setters and getters and transfer to java, and that's really all this does. I don't know why we would need to check this, but _not_ check the many other params ALS has.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #12896: [SPARK-14489][ML][PYSPARK] ALS unknown user/item predict...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12896
  
    **[Test build #69401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69401/consoleFull)** for PR 12896 at commit [`a439899`](https://github.com/apache/spark/commit/a4398995fbd3180f04ef1837113ce88a8703a6ee).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org