You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sueann <gi...@git.apache.org> on 2017/02/28 00:28:48 UTC

[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

GitHub user sueann opened a pull request:

    https://github.com/apache/spark/pull/17090

    [Spark-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on Dataframe 

    ## What changes were proposed in this pull request?
    
    This is a simple implementation of RecommendForAllUsers & RecommendForAllItems for the Dataframe version of ALS. It uses Dataframe operations (not a wrapper on the RDD implementation). Haven't benchmarked against a wrapper, but unit test examples do work.
    
    ## How was this patch tested?
    
    Unit tests
    ```
    $ build/sbt
    > mllib/testOnly *ALSSuite -- -z "recommendFor"
    > mllib/testOnly
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sueann/spark SPARK-19535

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17090.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17090
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    LGTM
    Any other comments before we merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Fitting into the CV / evaluator is actually fairly straightforward. It's just that the semantics of `transform` for top-k recommendation must fit into whatever we decide on for `RankingEvaluator`, so they are closely linked. (In other words, they must be compatible). Once the semantics (basically output schema for `transform`) are decided it's quite simple.
    
    It was discussed on the JIRA [here](https://issues.apache.org/jira/browse/SPARK-13857?focusedCommentId=15236796&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15236796) and [here](https://issues.apache.org/jira/browse/SPARK-13857?focusedCommentId=15822021&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15822021).
    
    I haven't had a chance to refine it yet but I have a view on the best approach now (basically to fit in with the design of [SPARK-14409](https://issues.apache.org/jira/browse/SPARK-14409) and in particular the basic version of #16618). I think that design / schema is more "DataFrame-like".
    
    In any event - I'm not against having the convenience methods for recommend-all here. I support it. Ultimately the `transform` approach is mostly for fitting into Pipelines & cross-validation. `transform` could call into these convenience methods (though it will need a DataFrame-based input version).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73628/testReport)** for PR 17090 at commit [`ef93575`](https://github.com/apache/spark/commit/ef93575ff30005fc9bab29dc5d55809e5bd773f4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    the same as  https://github.com/apache/spark/pull/12574 ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103349139
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import scala.language.implicitConversions
    +import scala.reflect.runtime.universe.TypeTag
    +
    +import org.apache.spark.sql.{Encoder, Encoders}
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.expressions.Aggregator
    +import org.apache.spark.util.BoundedPriorityQueue
    +
    +/**
    + * Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
    + * the top `num` K2 items based on the given Ordering.
    + */
    +
    +private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
    --- End diff --
    
    (It'd need its own unit tests, though not sure if we'll get everything in for 2.2)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103349080
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import scala.language.implicitConversions
    +import scala.reflect.runtime.universe.TypeTag
    +
    +import org.apache.spark.sql.{Encoder, Encoders}
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.expressions.Aggregator
    +import org.apache.spark.util.BoundedPriorityQueue
    +
    +/**
    + * Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
    + * the top `num` K2 items based on the given Ordering.
    + */
    +
    +private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
    --- End diff --
    
    we may want to put this somewhere more general to be used ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73543/testReport)** for PR 17090 at commit [`832b066`](https://github.com/apache/spark/commit/832b066f490c212b5a79fd045460525afd9576b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73787/testReport)** for PR 17090 at commit [`b0680db`](https://github.com/apache/spark/commit/b0680db96c0966ab25449d5716dac9e082db27a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    I'll merge this with master now
    Thanks @sueann  and @MLnick for feedback.  I'll prioritize helping with your work on transform, metrics, and tuning for ALS next.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    It could also be I'm overthinking things - and we can mould the `RankingEvaluator` to accept both types of input - the array version: `(Array(predictions), Array(labels))` or the "exploded" version: `(label, prediction)`.
    
    In which case it doesn't really matter what call we make here, as long as we're consistent with it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r104036563
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -594,6 +595,95 @@ class ALSSuite
           model.setColdStartStrategy(s).transform(data)
         }
       }
    +
    +  private def getALSModel = {
    +    val spark = this.spark
    +    import spark.implicits._
    +
    +    val userFactors = Seq(
    +      (0, Array(6.0f, 4.0f)),
    +      (1, Array(3.0f, 4.0f)),
    +      (2, Array(3.0f, 6.0f))
    +    ).toDF("id", "features")
    +    val itemFactors = Seq(
    +      (3, Array(5.0f, 6.0f)),
    +      (4, Array(6.0f, 2.0f)),
    +      (5, Array(3.0f, 6.0f)),
    +      (6, Array(4.0f, 1.0f))
    +    ).toDF("id", "features")
    +    val als = new ALS().setRank(2)
    +    new ALSModel(als.uid, als.getRank, userFactors, itemFactors)
    +      .setUserCol("user")
    +      .setItemCol("item")
    +  }
    +
    +  test("recommendForAllUsers with k < num_items") {
    +    val topItems = getALSModel.recommendForAllUsers(2)
    +    assert(topItems.count() == 3)
    +    assert(topItems.columns.contains("user"))
    +
    +    val expected = Map(
    +      0 -> Array(Row(3, 54f), Row(4, 44f)),
    +      1 -> Array(Row(3, 39f), Row(5, 33f)),
    +      2 -> Array(Row(3, 51f), Row(5, 45f))
    +    )
    +    checkRecommendations(topItems, expected, "item")
    +  }
    +
    +  test("recommendForAllUsers with k = num_items") {
    +    val topItems = getALSModel.recommendForAllUsers(4)
    +    assert(topItems.count() == 3)
    +    assert(topItems.columns.contains("user"))
    +
    +    val expected = Map(
    +      0 -> Array(Row(3, 54f), Row(4, 44f), Row(5, 42f), Row(6, 28f)),
    +      1 -> Array(Row(3, 39f), Row(5, 33f), Row(4, 26f), Row(6, 16f)),
    +      2 -> Array(Row(3, 51f), Row(5, 45f), Row(4, 30f), Row(6, 18f))
    +    )
    +    checkRecommendations(topItems, expected, "item")
    +  }
    +
    +  test("recommendForAllItems with k < num_users") {
    +    val topUsers = getALSModel.recommendForAllItems(2)
    +    assert(topUsers.count() == 4)
    +    assert(topUsers.columns.contains("item"))
    +
    +    val expected = Map(
    +      3 -> Array(Row(0, 54f), Row(2, 51f)),
    +      4 -> Array(Row(0, 44f), Row(2, 30f)),
    +      5 -> Array(Row(2, 45f), Row(0, 42f)),
    +      6 -> Array(Row(0, 28f), Row(2, 18f))
    +    )
    +    checkRecommendations(topUsers, expected, "user")
    +  }
    +
    +  test("recommendForAllItems with k = num_users") {
    +    val topUsers = getALSModel.recommendForAllItems(3)
    +    assert(topUsers.count() == 4)
    +    assert(topUsers.columns.contains("item"))
    +
    +    val expected = Map(
    +      3 -> Array(Row(0, 54f), Row(2, 51f), Row(1, 39f)),
    +      4 -> Array(Row(0, 44f), Row(2, 30f), Row(1, 26f)),
    +      5 -> Array(Row(2, 45f), Row(0, 42f), Row(1, 33f)),
    +      6 -> Array(Row(0, 28f), Row(2, 18f), Row(1, 16f))
    +    )
    +    checkRecommendations(topUsers, expected, "user")
    +  }
    +
    +  private def checkRecommendations(
    +      topK: DataFrame,
    +      expected: Map[Int, Array[Row]],
    +      dstColName: String): Unit = {
    +    assert(topK.columns.contains("recommendations"))
    +    topK.collect().foreach { row =>
    +      val id = row.getInt(0)
    +      val recs = row.getAs[WrappedArray[Row]]("recommendations")
    +      assert(recs === expected(id))
    +      assert(recs(0).fieldIndex(dstColName) == 0)
    +      assert(recs(0).fieldIndex("rating") == 1)
    --- End diff --
    
    Actually nevermind. Either way is committing to an incompatible API so the name one seems preferable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73553/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73866/testReport)** for PR 17090 at commit [`6a7e3d1`](https://github.com/apache/spark/commit/6a7e3d138b33c66644cdf68b6b20287ab0705aa6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    I commented further on the [JIRA](https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15898855&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15898855).
    
    Sorry if my other comments here and on JIRA were unclear. But the proposed schema for input to `RankingEvaluator` is:
    
    ### Schema 1
    ```
    +------+-------+------+----------+
    |userId|movieId|rating|prediction|
    +------+-------+------+----------+
    |   230|    318|   5.0| 4.2403245|
    |   230|   3424|   4.0|      null|
    |   230|  81191|  null|  4.317455|
    +------+-------+------+----------+
    ```
    
    You will notice that `rating` and `prediction` columns can be `null`. This is by design. There are three cases shown above:
    1. 1st row indicates a (user-item) pair that occurs in *both* the ground-truth set *and* the top-k predictions;
    2. 2nd row indicates a (user-item) pair that occurs in the ground-truth set, *but not* in the top-k predictions;
    3. 3rd row indicates a (user-item) pair that occurs in the top-k predictions, *but not* in the ground-truth set.
    
    _Note_ for reference, the input to the current `mllib` `RankingMetrics` is:
    
    ### Schema 2
    ```
    RDD[(true labels array, predicted labels array)],
    i.e.
    RDD of ([318, 3424, 7139,...], [81191, 93040, 31...])
    ```
    
    (So actually neither of the above schemas are easily compatible with the return schema here - but I think it is not really necessary to match the `mllib.RankingMetrics` format)
    
    ### ALS cross-validation
    
    My proposal for fitting ALS into cross-validation is the `ALSModel.transform` will output a DF of **Schema 1** - *only* when the parameters `k` and `recommendFor` are appropriately set, and the input DF contains both `user` and `item` columns. In practice, this scenario will occur during cross-validation only. 
    
    So what I am saying is that ALS itself (not the evaluator) must know how to return the correct DataFrame output from `transform` such that it can be used in a cross-validation as input to the `RankingEvaluator`.
    
    __Concretely:__
    ```scala
    val als = new ALS().setRecommendFor("user").setK(10)
    val validator = new TrainValidationSplit()
      .setEvaluator(new RankingEvaluator().setK(10))
      .setEstimator(als)
      .setEstimatorParamMaps(...)
    val bestModel = validator.fit(ratings)
    ```
    
    So while it is complex under the hood - to users it's simply a case of setting 2 params and the rest is as normal.
    
    Now, we have the best model selected by cross-validation. We can make recommendations using these convenience methods (I think it will need a cast):
    
    ```scala
    val recommendations = bestModel.asInstanceOf[ALSModel].recommendItemsforUsers(10)
    ```
    
    Alternatively, the `transform` version looks like this:
    ```scala
    val usersDF = ...
    +------+
    |userId|
    +------+
    |     1|
    |     2|
    |     3|
    +------+
    val recommendations = bestModel.transform(usersDF)
    ```
    
    So the questions:
    1. should we support the above `transform`-based recommendations? Or only support it for cross-validation purposes as a special case?
    2. if we do, what should the output schema of the above `transform` version look like? It must certainly match the output of `recommendX` methods.
    
    The options are:
    
    (1) The schema in this PR: 
    **Pros**: as you mention above - also more "compact"
    **Cons**: doesn't match up so closely with the `transform` "cross-validation" schema above
    
    (2) The schema below. It is basically an "exploded" version of option (1)
    
    ```
    +------+-------+----------+
    |userId|movieId|prediction|
    +------+-------+----------+
    |     1|      1|       4.3|
    |     1|      5|       3.2|
    |     1|      9|       2.1|
    +------+-------+----------+
    ```
    
    **Pros***: matches more closely with the cross-validation / evaluator input format. Perhaps slightly more "dataframe-like".
    **Cons**: less compact; lose ordering?; may require more munging to save to external data stores etc. 
    
    Anyway sorry for hijacking this PR discussion - but as I think you can see, the evaluator / ALS transform interplay is a bit subtle and requires some thought to get the right approach.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r104006788
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/TopByKeyAggregatorSuite.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Dataset
    +
    +
    +class TopByKeyAggregatorSuite extends SparkFunSuite with MLlibTestSparkContext {
    +
    +  private def getTopK(k: Int): Dataset[(Int, Array[(Int, Float)])] = {
    +    val sqlContext = spark.sqlContext
    +    import sqlContext.implicits._
    +
    +    val topKAggregator = new TopByKeyAggregator[Int, Int, Float](k, Ordering.by(_._2))
    +    Seq(
    +      (0, 3, 54f),
    --- End diff --
    
    maybe a good idea to have varying # values = like maybe one with only `1` etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103350410
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import scala.language.implicitConversions
    +import scala.reflect.runtime.universe.TypeTag
    +
    +import org.apache.spark.sql.{Encoder, Encoders}
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.expressions.Aggregator
    +import org.apache.spark.util.BoundedPriorityQueue
    +
    +/**
    + * Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
    + * the top `num` K2 items based on the given Ordering.
    + */
    +
    +private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
    +  (num: Int, ord: Ordering[(K2, V)])
    +  extends Aggregator[(K1, K2, V), BoundedPriorityQueue[(K2, V)], Array[(K2, V)]] {
    +
    +  override def zero: BoundedPriorityQueue[(K2, V)] = new BoundedPriorityQueue[(K2, V)](num)(ord)
    +  override def reduce(
    --- End diff --
    
    \U0001f44d


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73623/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73623/testReport)** for PR 17090 at commit [`41a11e7`](https://github.com/apache/spark/commit/41a11e71ca5c98a561964e5b6226d1824ca8f19c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73543/testReport)** for PR 17090 at commit [`832b066`](https://github.com/apache/spark/commit/832b066f490c212b5a79fd045460525afd9576b9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103350775
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -248,18 +248,18 @@ class ALSModel private[ml] (
       @Since("1.3.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
    +    if (userFeatures != null && itemFeatures != null) {
    +      blas.sdot(rank, userFeatures.toArray, 1, itemFeatures.toArray, 1)
    --- End diff --
    
    Good point! But since I copy-pasted this block in this PR, maybe it's okay to try it out in another PR? At least with what we have here we know it's not a regression. Want to make sure we get some version of ALS recommendForAll* in 2.2. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103353184
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -285,6 +285,43 @@ class ALSModel private[ml] (
     
       @Since("1.6.0")
       override def write: MLWriter = new ALSModel.ALSModelWriter(this)
    +
    +  @Since("2.2.0")
    +  def recommendForAllUsers(num: Int): DataFrame = {
    +    recommendForAll(userFactors, itemFactors, $(userCol), num)
    +  }
    +
    +  @Since("2.2.0")
    +  def recommendForAllItems(num: Int): DataFrame = {
    +    recommendForAll(itemFactors, userFactors, $(itemCol), num)
    +  }
    +
    +  /**
    +   * Makes recommendations for all users (or items).
    +   * @param srcFactors src factors for which to generate recommendations
    +   * @param dstFactors dst factors used to make recommendations
    +   * @param srcOutputColumn name of the column for the source in the output DataFrame
    +   * @param num number of recommendations for each record
    +   * @return a DataFrame of (srcOutputColumn: Int, recommendations), where recommendations are
    +   *         stored as an array of (dstId: Int, ratingL: Double) tuples.
    +   */
    +  private def recommendForAll(
    +      srcFactors: DataFrame,
    +      dstFactors: DataFrame,
    +      srcOutputColumn: String,
    +      num: Int): DataFrame = {
    +    import srcFactors.sparkSession.implicits._
    +
    +    val ratings = srcFactors.crossJoin(dstFactors)
    +      .select(
    +        srcFactors("id").as("srcId"),
    +        dstFactors("id").as("dstId"),
    +        predict(srcFactors("features"), dstFactors("features")).as($(predictionCol)))
    +    // We'll force the IDs to be Int. Unfortunately this converts IDs to Int in the output.
    +    val topKAggregator = new TopByKeyAggregator[Int, Int, Float](num, Ordering.by(_._2))
    +    ratings.as[(Int, Int, Float)].groupByKey(_._1).agg(topKAggregator.toColumn)
    --- End diff --
    
    It'd be nice to specify field names for dstId and rating and to document the schema in the recommend methods.  That will help users extract recommendations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73540/testReport)** for PR 17090 at commit [`707bc6b`](https://github.com/apache/spark/commit/707bc6b153a7f899fbf3fe2a5675cacba1f95711).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73628/testReport)** for PR 17090 at commit [`ef93575`](https://github.com/apache/spark/commit/ef93575ff30005fc9bab29dc5d55809e5bd773f4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    #12574 is a comprehensive solution that also intends to support cross-validation as well as recommending for a subset (or any arbitrary set) of users/items. So it solves [SPARK-10802](https://issues.apache.org/jira/browse/SPARK-13857) and [SPARK-10802](https://issues.apache.org/jira/browse/SPARK-10802) at the same time.
    
    That PR is in fully working state. I'm a little surprised to see work done on this rather than deciding on the input/output schema stuff...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r104006071
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -594,6 +595,95 @@ class ALSSuite
           model.setColdStartStrategy(s).transform(data)
         }
       }
    +
    +  private def getALSModel = {
    +    val spark = this.spark
    +    import spark.implicits._
    +
    +    val userFactors = Seq(
    +      (0, Array(6.0f, 4.0f)),
    +      (1, Array(3.0f, 4.0f)),
    +      (2, Array(3.0f, 6.0f))
    +    ).toDF("id", "features")
    +    val itemFactors = Seq(
    +      (3, Array(5.0f, 6.0f)),
    +      (4, Array(6.0f, 2.0f)),
    +      (5, Array(3.0f, 6.0f)),
    +      (6, Array(4.0f, 1.0f))
    +    ).toDF("id", "features")
    +    val als = new ALS().setRank(2)
    +    new ALSModel(als.uid, als.getRank, userFactors, itemFactors)
    +      .setUserCol("user")
    +      .setItemCol("item")
    +  }
    +
    +  test("recommendForAllUsers with k < num_items") {
    +    val topItems = getALSModel.recommendForAllUsers(2)
    +    assert(topItems.count() == 3)
    +    assert(topItems.columns.contains("user"))
    +
    +    val expected = Map(
    +      0 -> Array(Row(3, 54f), Row(4, 44f)),
    +      1 -> Array(Row(3, 39f), Row(5, 33f)),
    +      2 -> Array(Row(3, 51f), Row(5, 45f))
    +    )
    +    checkRecommendations(topItems, expected, "item")
    +  }
    +
    +  test("recommendForAllUsers with k = num_items") {
    +    val topItems = getALSModel.recommendForAllUsers(4)
    +    assert(topItems.count() == 3)
    +    assert(topItems.columns.contains("user"))
    +
    +    val expected = Map(
    +      0 -> Array(Row(3, 54f), Row(4, 44f), Row(5, 42f), Row(6, 28f)),
    +      1 -> Array(Row(3, 39f), Row(5, 33f), Row(4, 26f), Row(6, 16f)),
    +      2 -> Array(Row(3, 51f), Row(5, 45f), Row(4, 30f), Row(6, 18f))
    +    )
    +    checkRecommendations(topItems, expected, "item")
    +  }
    +
    +  test("recommendForAllItems with k < num_users") {
    +    val topUsers = getALSModel.recommendForAllItems(2)
    +    assert(topUsers.count() == 4)
    +    assert(topUsers.columns.contains("item"))
    +
    +    val expected = Map(
    +      3 -> Array(Row(0, 54f), Row(2, 51f)),
    +      4 -> Array(Row(0, 44f), Row(2, 30f)),
    +      5 -> Array(Row(2, 45f), Row(0, 42f)),
    +      6 -> Array(Row(0, 28f), Row(2, 18f))
    +    )
    +    checkRecommendations(topUsers, expected, "user")
    +  }
    +
    +  test("recommendForAllItems with k = num_users") {
    +    val topUsers = getALSModel.recommendForAllItems(3)
    +    assert(topUsers.count() == 4)
    +    assert(topUsers.columns.contains("item"))
    +
    +    val expected = Map(
    +      3 -> Array(Row(0, 54f), Row(2, 51f), Row(1, 39f)),
    +      4 -> Array(Row(0, 44f), Row(2, 30f), Row(1, 26f)),
    +      5 -> Array(Row(2, 45f), Row(0, 42f), Row(1, 33f)),
    +      6 -> Array(Row(0, 28f), Row(2, 18f), Row(1, 16f))
    +    )
    +    checkRecommendations(topUsers, expected, "user")
    +  }
    +
    +  private def checkRecommendations(
    +      topK: DataFrame,
    +      expected: Map[Int, Array[Row]],
    +      dstColName: String): Unit = {
    +    assert(topK.columns.contains("recommendations"))
    +    topK.collect().foreach { row =>
    --- End diff --
    
    It's a little strange to have all the `Row` stuff in these tests.
    
    You can do `topK.as[(Int, Seq[(Int, Float)])].collect.foreach { case (id, recs) => ...`
    
    Then adjust `expected` accordingly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    @MLnick  OK I think I misunderstood some of your comments above then.  I see the proposal in SPARK-14409 differs from this PR, so I agree it'd be nice to resolve it.  We can make changes to this PR's schema as long as it happens soon.
    
    Here are the pros of each as I see them:
    1. Nested schema (as in this PR): ```[user, Array((item, rating))]```
      * Easy to work with both nested & flattened schema (```df.select("recommendations.item")```)  (AFAIK there's no simple way to zip and nest the 2 columns when starting with the flattened schema.)
    2. Flattened schema (as in SPARK-14409): ```[user, Array(item), Array(rating)]```
      * More efficient to store in Row-based formats like Avro
    
    I'm not sure if there's a performance difference in the formats when stored in Tungsten Rows.  I think not, but that'd be good to know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103539599
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -547,6 +550,45 @@ class ALSSuite
           ALS.train(ratings)
         }
       }
    +
    +  test("recommendForAllUsers") {
    +    val numUsers = 20
    +    val numItems = 40
    +    val numRecs = 5
    +    val (training, test) = genExplicitTestData(numUsers, numItems, rank = 2, noiseStd = 0.01)
    +    val topItems =
    +      testALS(training, test, maxIter = 4, rank = 2, regParam = 0.01, targetRMSE = 0.03)
    --- End diff --
    
    Seems wasteful to compute and check a model here, and it doesn't really test that the predictions are what we expect them to be.
    
    We can construct an `ALSModel` with known factors and check the predictions are as expected (it's just a matrix multiply). See `MatrixFactorizationModelSuite` and tests in #12574 (based on those) for example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73866/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r104007245
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/TopByKeyAggregatorSuite.scala ---
    @@ -0,0 +1,81 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.Dataset
    +
    +
    +class TopByKeyAggregatorSuite extends SparkFunSuite with MLlibTestSparkContext {
    +
    +  private def getTopK(k: Int): Dataset[(Int, Array[(Int, Float)])] = {
    +    val sqlContext = spark.sqlContext
    +    import sqlContext.implicits._
    +
    +    val topKAggregator = new TopByKeyAggregator[Int, Int, Float](k, Ordering.by(_._2))
    +    Seq(
    +      (0, 3, 54f),
    +      (0, 4, 44f),
    +      (0, 5, 42f),
    +      (0, 6, 28f),
    +      (1, 3, 39f),
    +      (1, 4, 26f),
    +      (1, 5, 33f),
    +      (1, 6, 16f),
    +      (2, 3, 51f),
    +      (2, 4, 30f),
    +      (2, 5, 45f),
    +      (2, 6, 18f)
    +    ).toDS().groupByKey(_._1).agg(topKAggregator.toColumn)
    +  }
    +
    +  test("topByKey with k < #items") {
    +    val topK = getTopK(2)
    +    assert(topK.count() === 3)
    +
    +    val expected = Map(
    +      0 -> Array((3, 54f), (4, 44f)),
    +      1 -> Array((3, 39f), (5, 33f)),
    +      2 -> Array((3, 51f), (5, 45f))
    +    )
    +    checkTopK(topK, expected)
    +  }
    +
    +  test("topByKey with k > #items") {
    +    val topK = getTopK(5)
    +    assert(topK.count() === 3)
    +
    +    val expected = Map(
    +      0 -> Array((3, 54f), (4, 44f), (5, 42f), (6, 28f)),
    +      1 -> Array((3, 39f), (5, 33f), (4, 26f), (6, 16f)),
    +      2 -> Array((3, 51f), (5, 45f), (4, 30f), (6, 18f))
    +    )
    +    checkTopK(topK, expected)
    +  }
    +
    +  private def checkTopK(
    +      topK: Dataset[(Int, Array[(Int, Float)])],
    +      expected: Map[Int, Array[(Int, Float)]]): Unit = {
    +    topK.collect().foreach { record =>
    --- End diff --
    
    slightly prefer `foreach { case (id, recs) => ...` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103366357
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -285,6 +285,43 @@ class ALSModel private[ml] (
     
       @Since("1.6.0")
       override def write: MLWriter = new ALSModel.ALSModelWriter(this)
    +
    +  @Since("2.2.0")
    +  def recommendForAllUsers(num: Int): DataFrame = {
    +    recommendForAll(userFactors, itemFactors, $(userCol), num)
    +  }
    +
    +  @Since("2.2.0")
    +  def recommendForAllItems(num: Int): DataFrame = {
    +    recommendForAll(itemFactors, userFactors, $(itemCol), num)
    +  }
    +
    +  /**
    +   * Makes recommendations for all users (or items).
    +   * @param srcFactors src factors for which to generate recommendations
    +   * @param dstFactors dst factors used to make recommendations
    +   * @param srcOutputColumn name of the column for the source in the output DataFrame
    +   * @param num number of recommendations for each record
    +   * @return a DataFrame of (srcOutputColumn: Int, recommendations), where recommendations are
    +   *         stored as an array of (dstId: Int, ratingL: Double) tuples.
    +   */
    +  private def recommendForAll(
    +      srcFactors: DataFrame,
    +      dstFactors: DataFrame,
    +      srcOutputColumn: String,
    +      num: Int): DataFrame = {
    +    import srcFactors.sparkSession.implicits._
    +
    +    val ratings = srcFactors.crossJoin(dstFactors)
    +      .select(
    +        srcFactors("id").as("srcId"),
    +        dstFactors("id").as("dstId"),
    +        predict(srcFactors("features"), dstFactors("features")).as($(predictionCol)))
    +    // We'll force the IDs to be Int. Unfortunately this converts IDs to Int in the output.
    +    val topKAggregator = new TopByKeyAggregator[Int, Int, Float](num, Ordering.by(_._2))
    +    ratings.as[(Int, Int, Float)].groupByKey(_._1).agg(topKAggregator.toColumn)
    --- End diff --
    
    I'm not sure what a good way to do this is :-/ Ways I can think of but haven't succeeded in:
    1/ change the schema of the entire DataFrame
    2/ map over the rows in the DataFrame {
      map over the items in the array {
        convert from tuple (really a Row) to a Row with a different schema
      }
    }
    
    I tried using RowEncoder in either case, but the types haven't quite worked out. Any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73787/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103352432
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -285,6 +285,43 @@ class ALSModel private[ml] (
     
       @Since("1.6.0")
       override def write: MLWriter = new ALSModel.ALSModelWriter(this)
    +
    +  @Since("2.2.0")
    +  def recommendForAllUsers(num: Int): DataFrame = {
    +    recommendForAll(userFactors, itemFactors, $(userCol), num)
    +  }
    +
    +  @Since("2.2.0")
    +  def recommendForAllItems(num: Int): DataFrame = {
    +    recommendForAll(itemFactors, userFactors, $(itemCol), num)
    +  }
    +
    +  /**
    +   * Makes recommendations for all users (or items).
    +   * @param srcFactors src factors for which to generate recommendations
    +   * @param dstFactors dst factors used to make recommendations
    +   * @param srcOutputColumn name of the column for the source in the output DataFrame
    +   * @param num number of recommendations for each record
    +   * @return a DataFrame of (srcOutputColumn: Int, recommendations), where recommendations are
    +   *         stored as an array of (dstId: Int, ratingL: Double) tuples.
    --- End diff --
    
    ratingL -> rating


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Good point for copying some detail to JIRA, will do that.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103350750
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -285,6 +285,43 @@ class ALSModel private[ml] (
     
       @Since("1.6.0")
       override def write: MLWriter = new ALSModel.ALSModelWriter(this)
    +
    +  @Since("2.2.0")
    +  def recommendForAllUsers(num: Int): DataFrame = {
    --- End diff --
    
    Maybe rename to numItems (and numUsers in the other method)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73624/testReport)** for PR 17090 at commit [`4ac586a`](https://github.com/apache/spark/commit/4ac586aef019a22e301a0dfe9cea4d08abeec91d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103353799
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import scala.language.implicitConversions
    +import scala.reflect.runtime.universe.TypeTag
    +
    +import org.apache.spark.sql.{Encoder, Encoders}
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.expressions.Aggregator
    +import org.apache.spark.util.BoundedPriorityQueue
    +
    +/**
    + * Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
    + * the top `num` K2 items based on the given Ordering.
    + */
    +
    +private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
    +  (num: Int, ord: Ordering[(K2, V)])
    +  extends Aggregator[(K1, K2, V), BoundedPriorityQueue[(K2, V)], Array[(K2, V)]] {
    +
    +  override def zero: BoundedPriorityQueue[(K2, V)] = new BoundedPriorityQueue[(K2, V)](num)(ord)
    +  override def reduce(
    +    q: BoundedPriorityQueue[(K2, V)],
    +    a: (K1, K2, V)): BoundedPriorityQueue[(K2, V)] = q += {(a._2, a._3)}
    +  override def merge(
    +      q1: BoundedPriorityQueue[(K2, V)],
    +      q2: BoundedPriorityQueue[(K2, V)]): BoundedPriorityQueue[(K2, V)] = q1 ++= q2
    +  override def finish(r: BoundedPriorityQueue[(K2, V)]): Array[(K2, V)] =
    +    r.toArray.sorted(ord.reverse)
    +  override def bufferEncoder: Encoder[BoundedPriorityQueue[(K2, V)]] =
    +    Encoders.kryo[BoundedPriorityQueue[(K2, V)]]
    +  override def outputEncoder: Encoder[Array[(K2, V)]] = ExpressionEncoder[Array[(K2, V)]]
    --- End diff --
    
    IntelliJ style complaint: include "()" at end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73628/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73553/testReport)** for PR 17090 at commit [`ebd2604`](https://github.com/apache/spark/commit/ebd26043fc9432d41b83612dfefcc27229d318cb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103349429
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import scala.language.implicitConversions
    +import scala.reflect.runtime.universe.TypeTag
    +
    +import org.apache.spark.sql.{Encoder, Encoders}
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.expressions.Aggregator
    +import org.apache.spark.util.BoundedPriorityQueue
    +
    +/**
    + * Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
    + * the top `num` K2 items based on the given Ordering.
    + */
    +
    +private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
    +  (num: Int, ord: Ordering[(K2, V)])
    +  extends Aggregator[(K1, K2, V), BoundedPriorityQueue[(K2, V)], Array[(K2, V)]] {
    +
    +  override def zero: BoundedPriorityQueue[(K2, V)] = new BoundedPriorityQueue[(K2, V)](num)(ord)
    +  override def reduce(
    --- End diff --
    
    I think you need to throw some spaces and braces in here to make it a bit more readable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103351299
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -248,18 +248,18 @@ class ALSModel private[ml] (
       @Since("1.3.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
    --- End diff --
    
    You could rename userFeatures, itemFeatures to be featuresA, featuresB or something to make it clear that there is no ordering here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103538498
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -248,18 +248,18 @@ class ALSModel private[ml] (
       @Since("1.3.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
    --- End diff --
    
    They are switched in the case of `recommendForAllItems` so is that not less clear? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    It's a good point about making an implicit decision.  We could deprecate these methods in favor of transform-based ones in the future---we have done this in the past---but it does push the long-term decision in a clear direction.
    
    My hesitation about transform is not just about the schema.  It's also because I'm still unclear how we could plug ALS into tuning without having tuning specifically understand ALS (knowing about users, items, etc.).  I'll add my thoughts on the JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    For performance tests, I've been using the MovieLens `ml-latest` dataset [here](https://grouplens.org/datasets/movielens/). It has `24,404,096` ratings with `259,137` users and `39,443` movies.
    
    So it's not enormous but "recommend all" does a lot of work - generating `1,631,206,099` predicted ratings raw before the `top-k`.
    
    Some quick tests for the existing `recommendProductsForUsers` gives `306 sec`.
    ```
    scala> spark.time { oldModel.recommendProductsForUsers(k).count }
    Time taken: 306512 ms
    res11: Long = 259137
    ```
    
    As part of my performance testing I've tried a few approaches roughly similar to this PR, but using `Window` and `filter` rather than this top-k aggregator (which is a neat idea). 
    
    At first I thought this PR was really good: 
    ```
    scala> spark.time { newModel.recommendForAllUsers(k).count }
    Time taken: 151504 ms
    res3: Long = 259137
    ```
    
    `151 sec` seems fast!
    
    But then I tried this: 
    ```
    scala> spark.time { newModel.recommendForAllUsers(k).show }
    +------+--------------------+
    |userId|     recommendations|
    +------+--------------------+
    | 35982|[[131382,15.53116...|
    | 67782|[[131382,29.72169...|
    | 82672|[[132954,12.19152...|
    |155042|[[148954,16.09084...|
    |167532|[[118942,13.94282...|
    |168802|[[27212,11.881494...|
    |216112|[[109159,25.46359...|
    |243392|[[153010,9.85302]...|
    |255132|[[131382,15.50626...|
    |255362|[[131382,10.08476...|
    | 17389|[[152711,16.09958...|
    |120899|[[156956,12.61003...|
    |213089|[[82055,13.293286...|
    |253769|[[152711,16.57459...|
    |258129|[[152711,22.50499...|
    | 24347|[[152711,12.31282...|
    | 35947|[[153184,11.04110...|
    |103357|[[132954,13.26898...|
    |130557|[[118942,14.00168...|
    |156017|[[153010,12.24449...|
    +------+--------------------+
    only showing top 20 rows
    
    Time taken: 672524 ms
    ```
    
    `672 sec`, over 2x slower than `mllib` impl.
    
    Not sure why `count` is fast relative to `show` (maybe Spark SQL is not doing all the actual compute, while for `show` it does need to?).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    The performance of #12574 is not better than the existing `mllib` recommend-all - since it wraps the functionality it's roughly on par.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103538921
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/TopByKeyAggregator.scala ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.recommendation
    +
    +import scala.language.implicitConversions
    +import scala.reflect.runtime.universe.TypeTag
    +
    +import org.apache.spark.sql.{Encoder, Encoders}
    +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
    +import org.apache.spark.sql.expressions.Aggregator
    +import org.apache.spark.util.BoundedPriorityQueue
    +
    +
    +/**
    + * Works on rows of the form (K1, K2, V) where K1 & K2 are IDs and V is the score value. Finds
    + * the top `num` K2 items based on the given Ordering.
    + */
    +private[recommendation] class TopByKeyAggregator[K1: TypeTag, K2: TypeTag, V: TypeTag]
    --- End diff --
    
    I'd think we should have at least some basic tests for this - see `MLPairRDDFunctionsSuite` for example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    I should note that I've found the performance of "recommend all" to be very dependent on number of partitions since it controls the memory consumption per task (which can easily explode in the blocked `mllib` version) vs the CPU utilization & amount of shuffle data.
    
    For example, the default `mllib` results above use `192*192` = `36,864` partitions (due to cross-join). So it does prevent dying due to exploding memory & GC but is slower than using fewer partitions. However, too few partitions and it dies.
    
    I actually just realised that the defaults for `mllib` for user/item blocks - which in turn controls the partitions for the factors - is `defaultParallelism` (192 for my setup), while for `ml` it is `10`. Hence we need to create a like-for-like comparison.
    
    (Side note - it's not ideal actually that the num blocks drives the recommend-all partitions - because the optimal settings for training ALS are unlikely to be optimal for batch recommend-all prediction).
    
    Anyway some results of quick tests on my setup:
    
    Firstly, to match `mllib` defaults:
    
    `mllib` with 192*192: `323 sec`
    
    ```
    scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
    Time taken: 323367 ms
    ```
    
    `ml` with 192*192: `427 sec`
    
    ```
    scala> val newModel = newAls.setNumUserBlocks(192).setNumItemBlocks(192).fit(ratings)
    scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
    Time taken: 427174 ms
    ```
    
    So this PR is 30% slower - which is actually pretty decent given it's not using blocked BLAS operators.
    
    *Note* I didn't use netlib native BLAS, which could make a large difference when using level 3 BLAS in the blocked `mllib` version.
    
    Secondly, to match `ml` defaults:
    
    `mllib` with 10*10: `1654 sec`
    ```
    scala> val oldModel = OldALS.train(oldRatings, rank, iter, lambda, 10)
    scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
    Time taken: 1654951 ms
    ```
    
    `ml` with 10*10: `438 sec`
    ```
    scala> val newModel = newAls.fit(ratings)
    scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
    Time taken: 438328 ms
    ```
    
    In this case, the `mllib` version blows up with memory & GC dominating runtime, and this PR is over 3x faster (though it varies a lot: 600 sec above, 438 sec here, etc).
    
    Finally, middle of the road case:
    
    `mllib` with 96*96: `175 sec`
    ```
    scala> spark.time { oldModel.recommendProductsForUsers(k).foreach(_ => Unit)  }
    Time taken: 175880 ms
    ```
    
    `ml` with 96*96: `181 sec`
    ```
    scala> spark.time { newModel.recommendForAllUsers(k).foreach(_ => Unit) }
    Time taken: 181494 ms
    ```
    
    So a few % slower. Again pretty good actually considering it's not a blocked implementation. Still room to be optimized.
    
    After running these I tested against a blocked version using `DataFrame` (to more or less match the current `mllib` version) and it's much faster in the `192*192` case, a bit slower in `96*96` case and also blows up in the `10*10` case. Again, really dependent on partitioning...
    
    So the performance here is not too bad. The positive is it should avoid completely exploding. As I mentioned above I tried a similar DataFrame-based version using `Window` & `filter` and performance was terrible. It will be interesting to see if native BLAS adds anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    @hhbyyh  This is different from https://github.com/apache/spark/pull/12574 since it sidesteps the ongoing design discussions about input and output schema.  Eventually, I'd like us to proceed with https://github.com/apache/spark/pull/12574 but only after we've figured out the right workflow for using these methods in a Pipeline with tuning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Finally, I've done some work related to [SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and have a potential solution that seems to be pretty good. In this case it should be more than 3x faster than the best runtime here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103349590
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -248,18 +248,18 @@ class ALSModel private[ml] (
       @Since("1.3.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
    +    if (userFeatures != null && itemFeatures != null) {
    +      blas.sdot(rank, userFeatures.toArray, 1, itemFeatures.toArray, 1)
    --- End diff --
    
    I wonder how the overhead of converting to an array compares with the efficiency of calling sdot -- could be faster to just do the Seqs by hand? is it possible to operate on something besides Seq?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73787/testReport)** for PR 17090 at commit [`b0680db`](https://github.com/apache/spark/commit/b0680db96c0966ab25449d5716dac9e082db27a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r104013729
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -594,6 +595,95 @@ class ALSSuite
           model.setColdStartStrategy(s).transform(data)
         }
       }
    +
    +  private def getALSModel = {
    +    val spark = this.spark
    +    import spark.implicits._
    +
    +    val userFactors = Seq(
    +      (0, Array(6.0f, 4.0f)),
    +      (1, Array(3.0f, 4.0f)),
    +      (2, Array(3.0f, 6.0f))
    +    ).toDF("id", "features")
    +    val itemFactors = Seq(
    +      (3, Array(5.0f, 6.0f)),
    +      (4, Array(6.0f, 2.0f)),
    +      (5, Array(3.0f, 6.0f)),
    +      (6, Array(4.0f, 1.0f))
    +    ).toDF("id", "features")
    +    val als = new ALS().setRank(2)
    +    new ALSModel(als.uid, als.getRank, userFactors, itemFactors)
    +      .setUserCol("user")
    +      .setItemCol("item")
    +  }
    +
    +  test("recommendForAllUsers with k < num_items") {
    +    val topItems = getALSModel.recommendForAllUsers(2)
    +    assert(topItems.count() == 3)
    +    assert(topItems.columns.contains("user"))
    +
    +    val expected = Map(
    +      0 -> Array(Row(3, 54f), Row(4, 44f)),
    +      1 -> Array(Row(3, 39f), Row(5, 33f)),
    +      2 -> Array(Row(3, 51f), Row(5, 45f))
    +    )
    +    checkRecommendations(topItems, expected, "item")
    +  }
    +
    +  test("recommendForAllUsers with k = num_items") {
    +    val topItems = getALSModel.recommendForAllUsers(4)
    +    assert(topItems.count() == 3)
    +    assert(topItems.columns.contains("user"))
    +
    +    val expected = Map(
    +      0 -> Array(Row(3, 54f), Row(4, 44f), Row(5, 42f), Row(6, 28f)),
    +      1 -> Array(Row(3, 39f), Row(5, 33f), Row(4, 26f), Row(6, 16f)),
    +      2 -> Array(Row(3, 51f), Row(5, 45f), Row(4, 30f), Row(6, 18f))
    +    )
    +    checkRecommendations(topItems, expected, "item")
    +  }
    +
    +  test("recommendForAllItems with k < num_users") {
    +    val topUsers = getALSModel.recommendForAllItems(2)
    +    assert(topUsers.count() == 4)
    +    assert(topUsers.columns.contains("item"))
    +
    +    val expected = Map(
    +      3 -> Array(Row(0, 54f), Row(2, 51f)),
    +      4 -> Array(Row(0, 44f), Row(2, 30f)),
    +      5 -> Array(Row(2, 45f), Row(0, 42f)),
    +      6 -> Array(Row(0, 28f), Row(2, 18f))
    +    )
    +    checkRecommendations(topUsers, expected, "user")
    +  }
    +
    +  test("recommendForAllItems with k = num_users") {
    +    val topUsers = getALSModel.recommendForAllItems(3)
    +    assert(topUsers.count() == 4)
    +    assert(topUsers.columns.contains("item"))
    +
    +    val expected = Map(
    +      3 -> Array(Row(0, 54f), Row(2, 51f), Row(1, 39f)),
    +      4 -> Array(Row(0, 44f), Row(2, 30f), Row(1, 26f)),
    +      5 -> Array(Row(2, 45f), Row(0, 42f), Row(1, 33f)),
    +      6 -> Array(Row(0, 28f), Row(2, 18f), Row(1, 16f))
    +    )
    +    checkRecommendations(topUsers, expected, "user")
    +  }
    +
    +  private def checkRecommendations(
    +      topK: DataFrame,
    +      expected: Map[Int, Array[Row]],
    +      dstColName: String): Unit = {
    +    assert(topK.columns.contains("recommendations"))
    +    topK.collect().foreach { row =>
    +      val id = row.getInt(0)
    +      val recs = row.getAs[WrappedArray[Row]]("recommendations")
    +      assert(recs === expected(id))
    +      assert(recs(0).fieldIndex(dstColName) == 0)
    +      assert(recs(0).fieldIndex("rating") == 1)
    --- End diff --
    
    Should we have this requirement in the output (vs. the unnamed tuples)? It does cost more to support named fields (we have to call .rdd on the resulting dataframe in the process of imposing a new schema) and if we decide to remove it later it'll break any usage of the named fields. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103366071
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -248,18 +248,18 @@ class ALSModel private[ml] (
       @Since("1.3.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
    --- End diff --
    
    I think it's actually clearer to keep the names as it "instantiates" the usage to the reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    @jkbradley do we propose to add further methods to support recommending for all users (or items) in an input DF? like `recommendForAllUsers(dataset: DataFrame, num: Int)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17090


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73543/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103431594
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
    @@ -547,6 +550,45 @@ class ALSSuite
           ALS.train(ratings)
         }
       }
    +
    +  test("recommendForAllUsers") {
    +    val numUsers = 20
    +    val numItems = 40
    +    val numRecs = 5
    +    val (training, test) = genExplicitTestData(numUsers, numItems, rank = 2, noiseStd = 0.01)
    +    val topItems =
    +      testALS(training, test, maxIter = 4, rank = 2, regParam = 0.01, targetRMSE = 0.03)
    +        .recommendForAllUsers(numRecs)
    +
    +    assert(topItems.count() == numUsers)
    +    assert(topItems.columns.contains("user"))
    +    checkRecommendationOrdering(topItems, numRecs)
    +  }
    +
    +  test("recommendForAllItems") {
    +    val numUsers = 20
    +    val numItems = 40
    +    val numRecs = 5
    +    val (training, test) = genExplicitTestData(numUsers, numItems, rank = 2, noiseStd = 0.01)
    +    val topUsers =
    +      testALS(training, test, maxIter = 4, rank = 2, regParam = 0.01, targetRMSE = 0.03)
    +        .recommendForAllItems(numRecs)
    +
    +    assert(topUsers.count() == numItems)
    +    assert(topUsers.columns.contains("item"))
    +    checkRecommendationOrdering(topUsers, numRecs)
    +  }
    +
    +  private def checkRecommendationOrdering(topK: DataFrame, k: Int): Unit = {
    +    assert(topK.columns.contains("recommendations"))
    +    topK.select("recommendations").collect().foreach(
    --- End diff --
    
    Purely a style suggestion but you can get rid of one set of parens:
    
    ```
    ...foreach { row =>
      ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Thanks @MLnick for the explanation.  This is what I'd understood from your similar description on the JIRA, but definitely more in-depth.  (It might be good to copy to JIRA, or even a design doc at this point.)
    
    As I'd said, I haven't done a literature review on this, so it's hard for me to judge what schema evaluators should be given.  I see some implicit decisions, such as evaluators using implicit ratings (using rows missing either a label or a prediction) and us not computing predictions for all (user,item) pairs with labels.
    
    However, assuming the schema you've selected is best for evaluation, then I think this highlights 2 distinct needs for top K: (a) a user-friendly API (this PR) and (b) an evaluator-friendly API (your design).  For (a), many users have requested recommendForAll* methods matching the RDD-based equivalents, and this schema provides top K recommendations in an analogous and friendly schema.  If evaluator needs a less user-friendly schema, that's OK, but then I think it should be considered an internal/dev schema which can differ from the user-friendly version.
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103396443
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -248,18 +248,18 @@ class ALSModel private[ml] (
       @Since("1.3.0")
       def setPredictionCol(value: String): this.type = set(predictionCol, value)
     
    +  private val predict = udf { (userFeatures: Seq[Float], itemFeatures: Seq[Float]) =>
    --- End diff --
    
    Fair enough, though the inputs are switched in some uses


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    @jkbradley I've put my updated proposal for ranking evaluation [on SPARK-14409 comments](https://issues.apache.org/jira/browse/SPARK-14409?focusedCommentId=15896933&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15896933).
    
    It's different from this schema (and from the `mllib` recommend all return types). I've detailed the reasoning in the JIRA.
    
    I would have preferred a more explicit discussion & decision on the schema before merging this, but now that it's merged we should at least consider how (or if) to deal with the schema difference assuming the above proposal proceeds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    I'd been following the long discussions about a transform-based solution, but those had not seemed to have converged to a clear design.  If you feel they have in your PR, then I'll spend some time tomorrow going through your PR to catch up.  (Keep in mind that the code freeze is scheduled to happen in a day or two: http://spark.apache.org/versioning-policy.html )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by sethah <gi...@git.apache.org>.
Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    cc @MLnick


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    As for the API - I'm ok with having the "user-facing" version differ from the `transform` version. Though it may lead to some confusion. In this case, it's probably best to have `transform` only work for the cross-validation / evaluator setting and to make it clear in future how it all works and fits together, with appropriate doc, user guide and examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103397271
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -285,6 +288,57 @@ class ALSModel private[ml] (
     
       @Since("1.6.0")
       override def write: MLWriter = new ALSModel.ALSModelWriter(this)
    +
    +  /**
    +   * Returns top `numItems` items recommended for each user, for all users.
    +   * @param numItems max number of recommendations for each user
    +   * @return a DataFrame of (userCol: Int, recommendations), where recommendations are
    +   *         stored as an array of (itemId: Int, rating: Double) tuples.
    +   */
    +  @Since("2.2.0")
    +  def recommendForAllUsers(numItems: Int): DataFrame = {
    +    recommendForAll(userFactors, itemFactors, $(userCol), $(itemCol), numItems)
    +  }
    +
    +  /**
    +   * Returns top `numUsers` users recommended for each item, for all items.
    +   * @param numUsers max number of recommendations for each item
    +   * @return a DataFrame of (itemCol: Int, recommendations), where recommendations are
    +   *         stored as an array of (userId: Int, rating: Double) tuples.
    +   */
    +  @Since("2.2.0")
    +  def recommendForAllItems(numUsers: Int): DataFrame = {
    +    recommendForAll(itemFactors, userFactors, $(itemCol), $(userCol), numUsers)
    +  }
    +
    +  /**
    +   * Makes recommendations for all users (or items).
    +   * @param srcFactors src factors for which to generate recommendations
    +   * @param dstFactors dst factors used to make recommendations
    +   * @param srcOutputColumn name of the column for the source in the output DataFrame
    +   * @param dstOutputColumn name of the column for the destination in the output DataFrame
    +   * @param num max number of recommendations for each record
    +   * @return a DataFrame of (srcOutputColumn: Int, recommendations), where recommendations are
    +   *         stored as an array of (dstId: Int, rating: Double) tuples.
    +   */
    +  private def recommendForAll(
    +      srcFactors: DataFrame,
    +      dstFactors: DataFrame,
    +      srcOutputColumn: String,
    +      dstOutputColumn: String,
    --- End diff --
    
    Is dstOutputColumn from WIP specifying the schema?  I'll get back soon about ways to specify the schema


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73866/testReport)** for PR 17090 at commit [`6a7e3d1`](https://github.com/apache/spark/commit/6a7e3d138b33c66644cdf68b6b20287ab0705aa6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73624 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73624/testReport)** for PR 17090 at commit [`4ac586a`](https://github.com/apache/spark/commit/4ac586aef019a22e301a0dfe9cea4d08abeec91d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73553/testReport)** for PR 17090 at commit [`ebd2604`](https://github.com/apache/spark/commit/ebd26043fc9432d41b83612dfefcc27229d318cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by sueann <gi...@git.apache.org>.
Github user sueann commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    The output in https://github.com/apache/spark/pull/12574/ looks like a DataFrame with Row(srcCol: Int, "recommendations": Array[(Int, Float)]) so I think this PR as is matches the output type - @MLnick to confirm. Will make the suggested changes from everyone. Thanks a lot for the performance testing @MLnick! Really neat to see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73624/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by MLnick <gi...@git.apache.org>.
Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    Isn't deciding on the output schema for these methods essentially the same as deciding on transform semantics in #12574 (apart from the issue of how, or if, to have transform generate the "ground truth" items)?
    
    So in a way I'm not certain this "circumvents" the discussion around transform semantics / fitting into cross-validation & pipelines but rather makes an implicit decision on it. If this is set here, it would be rather awkward to have any future `transform` based recommend all functionality with a different output schema.
    
    My default choice in #12574 was to match the form of the existing `mllib` methods. That is the same as here, and matches the format for the existing `RankingMetrics` in `mllib`. So from that perspective it is the "easiest" choice.
    
    It's not necessarily the "best" choice - I go into the other option in detail on #12574 and the related JIRA. Ideally it should match up with the expected input schema for a new `RankingEvaluator` (not strictly necessary but definitely preferable).
    
    My concern here is that we make a quick decision implicitly due to time constraints and are stuck with it down the line as it is exposed in the public API. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17090: [Spark-19535][ML] RecommendForAllUsers RecommendF...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17090#discussion_r103354132
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -285,6 +286,55 @@ class ALSModel private[ml] (
     
       @Since("1.6.0")
       override def write: MLWriter = new ALSModel.ALSModelWriter(this)
    +
    +  /**
    +   * Returns top `num` items recommended for each user, for all users.
    +   * @param num number of recommendations for each user
    --- End diff --
    
    number -> max number


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    @MLnick Thanks for showing those comparison numbers.  If your implementation is faster, then I'm happy going with it.  I do wonder if we might hit scalability issues with RDDs which we would not hit with DataFrames, so it'd be worth revisiting a DF-based implementation later on.
    
    In terms of the API, my main worry about https://github.com/apache/spark/pull/12574 is that I haven't seen a full design of how ALS would be plugged into cross validator.  I still don't see how CV could handle ALS unless we specialized it for recommendation.  It was this uncertainty which made me comment on https://issues.apache.org/jira/browse/SPARK-13857 to recommend we go ahead and merge basic recommendAll methods, while continuing to figure out a good design for tuning.
    
    Feel free to push back, but I would really like to see a sketch of how ALS could plug into tuning.  I haven't spent the time to do a literature review on how tuning is generally done for recommendation, especially on the best ways to split the data into folds.
    
    > further methods to support recommending for all users (or items) in an input DF? like recommendForAllUsers(dataset: DataFrame, num: Int)
    
    I do think this sounds useful, but I'm focused on feature parity w.r.t. the RDD-based API right now.  It'd be nice to add later, though that could be via your proposed transform-based API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    @MLnick Thanks *a lot* for the detailed tests!  I really appreciate it.  In this case, are you OK with the approach in the current PR (pending reviews)?
    
    One thing we should confirm is what you think of the input/output schema for these methods.  It'd be nice to have the schema match what you think best for the transform-based recommendAll.
    
    Re: design: I saw the discussions in https://issues.apache.org/jira/browse/SPARK-13857 but I didn't see a clear statement of pros & cons leading to a design decision.  I'm sorry about not having time to get involved with the discussions, but I'd like to help out as useful in the coming months.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17090: [Spark-19535][ML] RecommendForAllUsers RecommendForAllIt...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/17090
  
    **[Test build #73623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73623/testReport)** for PR 17090 at commit [`41a11e7`](https://github.com/apache/spark/commit/41a11e71ca5c98a561964e5b6226d1824ca8f19c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org