You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by MLnick <gi...@git.apache.org> on 2017/07/27 08:58:45 UTC

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/18748

    [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel

    This PR adds methods `recommendForUserSubset` and `recommendForItemSubset` to `ALSModel`. These allow recommending for a specified set of user / item ids rather than for every user / item (as in the `recommendForAllX` methods).
    
    The subset methods take a `DataFrame` as input, containing ids in the column specified by the param `userCol` or `itemCol`. The model will generate recommendations for each _unique_ id in this input dataframe. 
    
    ## How was this patch tested?
    New unit tests in `ALSSuite`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark als-recommend-df

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18748
    
----
commit 860bc2ce5f290a042756d2569eb215eee6a1fdad
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-03-16T07:07:06Z

    wip

commit 8cd9edd5e2440da15f677828cda5207e6a40be31
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-04T07:40:22Z

    further wip

commit 76fb332aa5e8483590ebb4305901f5c3e5c73c15
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-09T09:19:38Z

    Update doc

commit 6539d294c5dac499d106f7346f496dac8fee24e8
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-09T09:20:55Z

    Update doc

commit c723dff8a9f125ce4d69574f47c74aaf0df7a9da
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-09T09:23:20Z

    Update doc

commit 0004d1c9ea5074965d234fa7833450de3ffa871b
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-10T09:15:35Z

    wip on tests

commit 53229a1abc860aa8fb3c0d933fdbcef4d47f0508
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-05-12T10:42:23Z

    Clean up docs and further tests

commit 5a8c4216ce636dea3ba67baa9b169db7486f37f2
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-07-27T08:28:11Z

    Explicitly handle duplicate ids with distinct. Update tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #79998 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79998/testReport)** for PR 18748 at commit [`4bd91f1`](https://github.com/apache/spark/commit/4bd91f12e1b15657d92fea6d7b91dae2e6e68c29).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #79998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79998/testReport)** for PR 18748 at commit [`4bd91f1`](https://github.com/apache/spark/commit/4bd91f12e1b15657d92fea6d7b91dae2e6e68c29).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18748#discussion_r137815796
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -356,6 +371,40 @@ class ALSModel private[ml] (
       }
     
       /**
    +   * Returns top `numUsers` users recommended for each item id in the input data set. Note that if
    +   * there are duplicate ids in the input dataset, only one set of recommendations per unique id
    +   * will be returned.
    +   * @param dataset a Dataset containing a column of item ids. The column name must match `itemCol`.
    +   * @param numUsers max number of recommendations for each item.
    +   * @return a DataFrame of (itemCol: Int, recommendations), where recommendations are
    +   *         stored as an array of (userCol: Int, rating: Float) Rows.
    +   */
    +  @Since("2.3.0")
    +  def recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame = {
    +    val srcFactorSubset = getSourceFactorSubset(dataset, itemFactors, $(itemCol))
    +    recommendForAll(srcFactorSubset, userFactors, $(itemCol), $(userCol), numUsers)
    +  }
    +
    +  /**
    +   * Returns a subset of a factor DataFrame limited to only those unique ids contained
    +   * in the input dataset.
    +   * @param dataset input Dataset containing id column to user to filter factors.
    +   * @param factors factor DataFrame to filter.
    +   * @param column column name containing the ids in the input dataset.
    +   * @return DataFrame containing factors only for those ids present in both the input dataset and
    +   *         the factor DataFrame.
    +   */
    +  private def getSourceFactorSubset(
    +      dataset: Dataset[_],
    +      factors: DataFrame,
    +      column: String): DataFrame = {
    +    dataset.select(column)
    +      .distinct()
    +      .join(factors, dataset(column) === factors("id"))
    +      .select(factors("id"), factors("features"))
    +  }
    --- End diff --
    
    I think we can use "left semi join" to eliminate the "distinct" here.
    ```
    factors
      .join(dataset.select(column), factors("id") === dataset(column), "left_semi")
      .select(factors("id"), factors("features"))
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #79997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79997/testReport)** for PR 18748 at commit [`5a8c421`](https://github.com/apache/spark/commit/5a8c4216ce636dea3ba67baa9b169db7486f37f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79997/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79998/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82458/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18748#discussion_r139161851
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -356,6 +371,40 @@ class ALSModel private[ml] (
       }
     
       /**
    +   * Returns top `numUsers` users recommended for each item id in the input data set. Note that if
    +   * there are duplicate ids in the input dataset, only one set of recommendations per unique id
    +   * will be returned.
    +   * @param dataset a Dataset containing a column of item ids. The column name must match `itemCol`.
    +   * @param numUsers max number of recommendations for each item.
    +   * @return a DataFrame of (itemCol: Int, recommendations), where recommendations are
    +   *         stored as an array of (userCol: Int, rating: Float) Rows.
    +   */
    +  @Since("2.3.0")
    +  def recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame = {
    +    val srcFactorSubset = getSourceFactorSubset(dataset, itemFactors, $(itemCol))
    +    recommendForAll(srcFactorSubset, userFactors, $(itemCol), $(userCol), numUsers)
    +  }
    +
    +  /**
    +   * Returns a subset of a factor DataFrame limited to only those unique ids contained
    +   * in the input dataset.
    +   * @param dataset input Dataset containing id column to user to filter factors.
    +   * @param factors factor DataFrame to filter.
    +   * @param column column name containing the ids in the input dataset.
    +   * @return DataFrame containing factors only for those ids present in both the input dataset and
    +   *         the factor DataFrame.
    +   */
    +  private def getSourceFactorSubset(
    +      dataset: Dataset[_],
    +      factors: DataFrame,
    +      column: String): DataFrame = {
    +    dataset.select(column)
    +      .distinct()
    +      .join(factors, dataset(column) === factors("id"))
    +      .select(factors("id"), factors("features"))
    +  }
    --- End diff --
    
    Oh! But the order of table in left-semi-join matters:
    
    You should use
    `factors.join(dataset.select("user"), factors("id") === dataset("user"), "left_semi")`
    instead of 
    `dataset.select("user").join(factors, dataset("user") === factors("id"), "left_semi")`
    
    they will generate different result.
    
    ```
    scala> factors.join(dataset.select("user"), factors("id") === dataset("user"), "left_semi").show
    +---+--------+
    | id|features|
    +---+--------+
    |  0|  [0, 1]|
    |  3|  [3, 4]|
    +---+--------+
    
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #82458 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82458/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #82456 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82456/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    @srowen not really sure about which of `Set` vs the `Dataset` would be more common. I'm inclined to stick with `Dataset` to keep the API consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82456/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #81825 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81825/testReport)** for PR 18748 at commit [`8ed91ab`](https://github.com/apache/spark/commit/8ed91ab283ccaa0b47ebe8467acc186aeca20c54).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18748#discussion_r139164479
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -356,6 +371,40 @@ class ALSModel private[ml] (
       }
     
       /**
    +   * Returns top `numUsers` users recommended for each item id in the input data set. Note that if
    +   * there are duplicate ids in the input dataset, only one set of recommendations per unique id
    +   * will be returned.
    +   * @param dataset a Dataset containing a column of item ids. The column name must match `itemCol`.
    +   * @param numUsers max number of recommendations for each item.
    +   * @return a DataFrame of (itemCol: Int, recommendations), where recommendations are
    +   *         stored as an array of (userCol: Int, rating: Float) Rows.
    +   */
    +  @Since("2.3.0")
    +  def recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame = {
    +    val srcFactorSubset = getSourceFactorSubset(dataset, itemFactors, $(itemCol))
    +    recommendForAll(srcFactorSubset, userFactors, $(itemCol), $(userCol), numUsers)
    +  }
    +
    +  /**
    +   * Returns a subset of a factor DataFrame limited to only those unique ids contained
    +   * in the input dataset.
    +   * @param dataset input Dataset containing id column to user to filter factors.
    +   * @param factors factor DataFrame to filter.
    +   * @param column column name containing the ids in the input dataset.
    +   * @return DataFrame containing factors only for those ids present in both the input dataset and
    +   *         the factor DataFrame.
    +   */
    +  private def getSourceFactorSubset(
    +      dataset: Dataset[_],
    +      factors: DataFrame,
    +      column: String): DataFrame = {
    +    dataset.select(column)
    +      .distinct()
    +      .join(factors, dataset(column) === factors("id"))
    +      .select(factors("id"), factors("features"))
    +  }
    --- End diff --
    
    Ah yeah, ok. Thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Ah ok - that clears things up. Yes that `predict` method is very inefficient relative to the `recommendForAll` setup.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    I don't get similar results to you (granted I have just tested locally). 
    
    ```
    scala> spark.time { userRecsAll.foreach(_ => Unit) }
    Time taken: 122422 ms
    
    scala> spark.time { userRecsPart.foreach(_ => Unit) }
    Time taken: 50228 ms
    ```
    
    Here, `userRecsPart` is a 30% sample, and the time is ~40% of the `recommendForAllUsers` time. I will try some larger-scale tests. It could be that the `join` and `distinct` causes the underperformance. 
    
    However, those operations would increase the number of partitions in the computation a lot due to `spark.sql.shuffle.partitions` setting if using defaults. Setting this to say `8` (the number of threads I have locally), I get 
    
    ```
    scala> spark.time { userRecsPart.foreach(_ => Unit) }
    Time taken: 37362 ms
    ```
    
    So, about 30% of the full time for the 30% sample.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Seems reasonable; would it be more or less common/natural for someone to specify the users as a simple set, rather than a Dataset? not sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18748#discussion_r139141803
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -356,6 +371,40 @@ class ALSModel private[ml] (
       }
     
       /**
    +   * Returns top `numUsers` users recommended for each item id in the input data set. Note that if
    +   * there are duplicate ids in the input dataset, only one set of recommendations per unique id
    +   * will be returned.
    +   * @param dataset a Dataset containing a column of item ids. The column name must match `itemCol`.
    +   * @param numUsers max number of recommendations for each item.
    +   * @return a DataFrame of (itemCol: Int, recommendations), where recommendations are
    +   *         stored as an array of (userCol: Int, rating: Float) Rows.
    +   */
    +  @Since("2.3.0")
    +  def recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame = {
    +    val srcFactorSubset = getSourceFactorSubset(dataset, itemFactors, $(itemCol))
    +    recommendForAll(srcFactorSubset, userFactors, $(itemCol), $(userCol), numUsers)
    +  }
    +
    +  /**
    +   * Returns a subset of a factor DataFrame limited to only those unique ids contained
    +   * in the input dataset.
    +   * @param dataset input Dataset containing id column to user to filter factors.
    +   * @param factors factor DataFrame to filter.
    +   * @param column column name containing the ids in the input dataset.
    +   * @return DataFrame containing factors only for those ids present in both the input dataset and
    +   *         the factor DataFrame.
    +   */
    +  private def getSourceFactorSubset(
    +      dataset: Dataset[_],
    +      factors: DataFrame,
    +      column: String): DataFrame = {
    +    dataset.select(column)
    +      .distinct()
    +      .join(factors, dataset(column) === factors("id"))
    +      .select(factors("id"), factors("features"))
    +  }
    --- End diff --
    
    How does that eliminate the need for `distinct`?
    
    e.g. take a look at the below:
    
    ```
    scala> factors.show
    +---+--------+
    | id|features|
    +---+--------+
    |  0|  [0, 1]|
    |  1|  [1, 2]|
    |  2|  [2, 3]|
    |  3|  [3, 4]|
    +---+--------+
    
    
    scala> dataset.show
    +----+
    |user|
    +----+
    |   0|
    |   0|
    |   3|
    |   0|
    +----+
    
    
    scala> dataset.select("user").join(factors, dataset("user") === factors("id"), "left_semi").show
    +----+
    |user|
    +----+
    |   0|
    |   0|
    |   3|
    |   0|
    +----+
    
    scala> dataset.select("user").join(factors, dataset("user") === factors("id"), "left_semi").select(factors("id"), factors("features"))
    org.apache.spark.sql.AnalysisException: resolved attribute(s) id#75,features#76 missing from user#17 in operator !Project [id#75, features#76];;
    !Project [id#75, features#76]
    +- Join LeftSemi, (user#17 = id#75)
       :- Project [user#17]
       :  +- Project [value#15 AS user#17]
       :     +- LocalRelation [value#15]
       +- Project [_1#72 AS id#75, _2#73 AS features#76]
          +- LocalRelation [_1#72, _2#73]
    
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:89)
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:276)
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
      at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
      at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:89)
      at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53)
      at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:69)
      at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3103)
      at org.apache.spark.sql.Dataset.select(Dataset.scala:1255)
      ... 48 elided
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by mpjlu <gi...@git.apache.org>.

Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **Note 1** this implementation must perform a `distinct` on the input data frame id column to guarantee correct results, since otherwise multiple "copies" of the same recommendations would be generated for duplicate ids, and the resulting recommendations contain duplicates. This could alternatively be left to the user to handle, and assume that the input data frame contains no duplicates. But for now I've opted for the safest option even if it introduces this inefficiency.
    
    **Note 2** This does not support `coldStartStrategy`. Therefore no recommendations will be returned for ids in the input dataframe that are not contained in the model (this is analogous to `coldStartStrategy=drop` for `transform`). I believe this makes most sense, since supporting something like the `na` option would be a bit involved and not add that much value. However it could be done (but would need to return `null` rows in the `recommendation` column for these cases). Later, when other cold start strategies might be supported (e.g. average factor vectors), this method could return recommendations even for ids that are not contained in the model.
    
    cc @srowen @jkbradley @yanboliang @mpjlu @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by mpjlu <gi...@git.apache.org>.

Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Thanks.
    This is my test setting:
    3 workers， each: 40 cores, 196G memory,  1 executor.
    Data Size: user 480,000, item 17,000


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Jenkins retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Ok, so I did some larger-scale test on a cluster (3x workers, each with 48 cores / 100GB allocated RAM with 1 executor).
    
    On same `movielens-latest` datasets (~250,000 users and ~33,000 movies), using a **30% sample** of user ids:
    
    ```
    scala> // all users
    scala> spark.time { model.recommendForAllUsers(k).foreach(_ => Unit) }
    Time taken: 25104 ms
    
    scala> // user sample
    scala> spark.time { model.recommendForUserSubset(userSample, k).foreach(_ => Unit) }
    Time taken: 8963 ms
    
    scala> 8963 / 25104.0
    res16: Double = 0.35703473550031867
    ```
    
    On a much larger dataset - Amazon books ratings data (8 million users, 2.3 million items) also using a **30% user sample**:
    
    ```
    scala> // all users
    scala> spark.time { model.recommendForAllUsers(k).foreach(_ => Unit) }
    Time taken: 32985936 ms
    => 9.16 hours
    
    scala> // user sample
    scala> spark.time { model.recommendForUserSubset(userSample, k).foreach(_ => Unit) }
    Time taken: 8164421 ms
    => 2.26 hours
    
    scala> 8164421 / 32985936.0
    res7: Double = 0.24751218216151272
    ```
    
    So it's a reasonably consistent range *25-35%* of time for a *30%* user sample (I found broadly similar results with a 70% user sample, taking about 60% of the recommend-for-all time).
    
    @mpjlu could you double check your results? What I find is consistent with my expectations that computing for a subset should take time roughly proportional to the ratio of the ids in the subset to the total. It appears to me the extra  `distinct` and `join` don't have too much impact on overall runtime.
    
    However your results are very different so we should understand why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18748


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Any further comments on this? @srowen @mpjlu @jkbradley?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81825/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18748: [SPARK-20679][ML] Support recommending for a subs...

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18748#discussion_r139139874
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala ---
    @@ -356,6 +371,40 @@ class ALSModel private[ml] (
       }
     
       /**
    +   * Returns top `numUsers` users recommended for each item id in the input data set. Note that if
    +   * there are duplicate ids in the input dataset, only one set of recommendations per unique id
    +   * will be returned.
    +   * @param dataset a Dataset containing a column of item ids. The column name must match `itemCol`.
    +   * @param numUsers max number of recommendations for each item.
    +   * @return a DataFrame of (itemCol: Int, recommendations), where recommendations are
    +   *         stored as an array of (userCol: Int, rating: Float) Rows.
    +   */
    +  @Since("2.3.0")
    +  def recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame = {
    +    val srcFactorSubset = getSourceFactorSubset(dataset, itemFactors, $(itemCol))
    +    recommendForAll(srcFactorSubset, userFactors, $(itemCol), $(userCol), numUsers)
    +  }
    +
    +  /**
    +   * Returns a subset of a factor DataFrame limited to only those unique ids contained
    +   * in the input dataset.
    +   * @param dataset input Dataset containing id column to user to filter factors.
    +   * @param factors factor DataFrame to filter.
    +   * @param column column name containing the ids in the input dataset.
    +   * @return DataFrame containing factors only for those ids present in both the input dataset and
    +   *         the factor DataFrame.
    +   */
    +  private def getSourceFactorSubset(
    +      dataset: Dataset[_],
    +      factors: DataFrame,
    +      column: String): DataFrame = {
    +    dataset.select(column)
    +      .distinct()
    +      .join(factors, dataset(column) === factors("id"))
    +      .select(factors("id"), factors("features"))
    +  }
    --- End diff --
    
    Thanks will look at using that here


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by mpjlu <gi...@git.apache.org>.

Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Thanks @MLnick . I have double checked my test.
    Since there is no  recommendForUserSubset , my previous test is MLLIB MatrixFactorizationModel::predict(RDD(Int, Int)), which predicts the rating of many users for many products. The performance of this function is low comparing with recommendForAll. 
    This PR calls recommendForAll with a subset of the users, I agree with your test results. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #82458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82458/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by mpjlu <gi...@git.apache.org>.

Github user mpjlu commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Did you test the performance of this, I tested the performance of MLLIB  recommendForUserSubset some days ago, the performance is not good. Suppose the time of recommendForAll is 35s, recommend for 1/3 Users use this may need 90s. Maybe it is faster to use recommendForAll then select 1/3 users.  But if recommend tens or hundreds of users, this is faster than recommendForAll. So should we add come commends in the code about when it is better to use recommendForUserSubset. 
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #81825 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81825/testReport)** for PR 18748 at commit [`8ed91ab`](https://github.com/apache/spark/commit/8ed91ab283ccaa0b47ebe8467acc186aeca20c54).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #79997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79997/testReport)** for PR 18748 at commit [`5a8c421`](https://github.com/apache/spark/commit/5a8c4216ce636dea3ba67baa9b169db7486f37f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18748
  
    **[Test build #82456 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82456/testReport)** for PR 18748 at commit [`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org