You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by BenFradet <gi...@git.apache.org> on 2015/12/21 11:25:27 UTC

[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

GitHub user BenFradet opened a pull request:

    https://github.com/apache/spark/pull/10411

    [SPARK-12247] [ML] [DOC] Documentation for spark.ml's ALS and collaborative filtering in general

    This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BenFradet/spark SPARK-12247

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10411.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10411
    
----
commit e0dcf6b08f2e7fa758053241cb7a3e4b6a52630b
Author: BenFradet <be...@gmail.com>
Date:   2015-12-19T14:43:00Z

    typo in ml's als implementation

commit 0f5296c4e86ac97a6cde816fa7402a03a09e6e35
Author: BenFradet <be...@gmail.com>
Date:   2015-12-19T14:43:27Z

    started the doc for ml's als

commit 998dbe3756e9d0eca79773c2cb6671d94a10ce79
Author: BenFradet <be...@gmail.com>
Date:   2015-12-19T14:47:28Z

    typo in mllib-collaborative-filtering doc

commit d0fddbe526950144188d3b9e5b4f93c7bec1a61e
Author: BenFradet <be...@gmail.com>
Date:   2015-12-19T16:11:27Z

    made [scala|java|python] doc references more consistent in mllib-collaborative-filtering

commit f5f3b916bebd684f9961d959bf17d1138089ddb5
Author: BenFradet <be...@gmail.com>
Date:   2015-12-19T16:14:19Z

    cleanup of the examples

commit 44375ebb3a4362a977221fd3e6319bea4b4f95c5
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T13:30:45Z

    ALS example in scala

commit 01a900f3b13da8fa1b488d141e0b41a62bdd64ac
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T14:30:29Z

    added links to the collaborative filtering section

commit 57ab45039b9c250ff20094d8f0035f580bc2de5d
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T14:33:57Z

    added a few comments

commit 0541d76f57345d5d91f65f2d9095aaa017e60099
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T15:08:34Z

    rmd case class

commit 772cef9dc889a7d7f02cd11a9720b5e46782f102
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T15:17:35Z

    rmd dep on file

commit d51a5357ce6bc67e03af5161e54de205a1f6d1d1
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T20:49:07Z

    fix typing issue in the scala example

commit 1be75e6c95ce645bc4d4a429ca323e0f8b678c30
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T20:49:20Z

    java example

commit b7d5491b3408296631761c15c37623d23f3034b5
Author: BenFradet <be...@gmail.com>
Date:   2015-12-20T21:38:05Z

    python example

commit 078736232378f9a98593d7a339b1545c801c6f3c
Author: BenFradet <be...@gmail.com>
Date:   2015-12-21T10:21:48Z

    explanation on implicit feedback

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52829527
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.sql.SQLContext;
    +
    +// $example on$
    +import java.io.Serializable;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.ml.evaluation.RegressionEvaluator;
    +import org.apache.spark.ml.recommendation.ALS;
    +import org.apache.spark.ml.recommendation.ALSModel;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.types.DataTypes;
    +// $example off$
    +
    +public class JavaALSExample {
    +
    +  // $example on$
    +  public static class Rating implements Serializable {
    +    private int userId;
    +    private int movieId;
    +    private float rating;
    +    private long timestamp;
    +
    +    public int getUserId() {
    +      return userId;
    +    }
    +
    +    public void setUserId(int userId) {
    --- End diff --
    
    Aha. Can you omit the setters, leave a no-arg constructor, and leave them non-final? That's JavaBeans-friendly and may be simpler and close enough. If there's any catch though just leave it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52728479
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -0,0 +1,148 @@
    +---
    +layout: global
    +title: Collaborative Filtering - spark.ml
    +displayTitle: Collaborative Filtering - spark.ml
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Collaborative filtering 
    +
    +[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
    +is commonly used for recommender systems.  These techniques aim to fill in the
    +missing entries of a user-item association matrix.  `spark.ml` currently supports
    +model-based collaborative filtering, in which users and products are described
    +by a small set of latent factors that can be used to predict missing entries.
    +`spark.ml` uses the [alternating least squares
    +(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
    +algorithm to learn these latent factors. The implementation in `spark.ml` has the
    +following parameters:
    +
    +* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
    +* *rank* is the number of latent factors in the model (defaults to 10).
    +* *maxIter* is the maximum number of iterations to run (defaults to 10).
    +* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
    +* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
    +  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
    +* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
    +  *baseline* confidence in preference observations (defaults to 1.0).
    +* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
    +
    +### Explicit vs. implicit feedback
    +
    +The standard approach to matrix factorization based collaborative filtering treats 
    +the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    +
    +It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
    +clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
    +from
    +[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    +Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
    +as a combination of binary preferences and *confidence values*. The ratings are then related to the
    +level of confidence in observed user preferences, rather than explicit ratings given to items.  The
    +model then tries to find latent factors that can be used to predict the expected preference of a
    +user for an item.
    +
    +### Scaling of the regularization parameter
    +
    +We scale the regularization parameter `regParam` in solving each least squares problem by
    +the number of ratings the user generated in updating user factors,
    +or the number of ratings the product received in updating product factors.
    +This approach is named "ALS-WR" and discussed in the paper
    +"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
    +It makes `regParam` less dependent on the scale of the dataset.
    +So we can apply the best parameter learned from a sampled subset to the full dataset
    +and expect similar performance.
    +
    +## Examples
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +In the following example, we load rating data from the
    +[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
    --- End diff --
    
    Do people need to download this now? which file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166845460
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-184335307
  
    **[Test build #51318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51318/consoleFull)** for PR 10411 at commit [`9b351e9`](https://github.com/apache/spark/commit/9b351e914e87012298ab773d6b76ec019a735b6f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-183093903
  
    @srowen @coderxiang Do you have time to review this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166277975
  
    cc @thunterdb 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-184320443
  
    **[Test build #51318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51318/consoleFull)** for PR 10411 at commit [`9b351e9`](https://github.com/apache/spark/commit/9b351e914e87012298ab773d6b76ec019a735b6f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182253960
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166274719
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48109/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182601737
  
    **[Test build #51051 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51051/consoleFull)** for PR 10411 at commit [`9021f36`](https://github.com/apache/spark/commit/9021f36d767b00fb2942e01ac6caba53c1466152).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52727907
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.sql.SQLContext;
    +
    +// $example on$
    +import java.io.Serializable;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.ml.evaluation.RegressionEvaluator;
    +import org.apache.spark.ml.recommendation.ALS;
    +import org.apache.spark.ml.recommendation.ALSModel;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.types.DataTypes;
    +// $example off$
    +
    +public class JavaALSExample {
    +
    +  // $example on$
    +  public static class Rating implements Serializable {
    +    private int userId;
    +    private int movieId;
    +    private float rating;
    +    private long timestamp;
    +
    +    public int getUserId() {
    +      return userId;
    +    }
    +
    +    public void setUserId(int userId) {
    +      this.userId = userId;
    +    }
    +
    +    public int getMovieId() {
    +      return movieId;
    +    }
    +
    +    public void setMovieId(int movieId) {
    +      this.movieId = movieId;
    +    }
    +
    +    public float getRating() {
    +      return rating;
    +    }
    +
    +    public void setRating(float rating) {
    +      this.rating = rating;
    +    }
    +
    +    public long getTimestamp() {
    +      return timestamp;
    +    }
    +
    +    public void setTimestamp(long timestamp) {
    +      this.timestamp = timestamp;
    +    }
    +
    +    public static Rating parseRating(String str) {
    +      String[] fields = str.split("::");
    +      assert(fields.length == 4);
    --- End diff --
    
    You don't want to add `assert`s in Java


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166438036
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48130/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182089083
  
    **[Test build #50995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50995/consoleFull)** for PR 10411 at commit [`2603e42`](https://github.com/apache/spark/commit/2603e4281b8a6bb5633e752117a112d3544c892a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182089675
  
    **[Test build #50995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50995/consoleFull)** for PR 10411 at commit [`2603e42`](https://github.com/apache/spark/commit/2603e4281b8a6bb5633e752117a112d3544c892a).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10411


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52739112
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -0,0 +1,148 @@
    +---
    +layout: global
    +title: Collaborative Filtering - spark.ml
    +displayTitle: Collaborative Filtering - spark.ml
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Collaborative filtering 
    +
    +[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
    +is commonly used for recommender systems.  These techniques aim to fill in the
    +missing entries of a user-item association matrix.  `spark.ml` currently supports
    +model-based collaborative filtering, in which users and products are described
    +by a small set of latent factors that can be used to predict missing entries.
    +`spark.ml` uses the [alternating least squares
    +(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
    +algorithm to learn these latent factors. The implementation in `spark.ml` has the
    +following parameters:
    +
    +* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
    +* *rank* is the number of latent factors in the model (defaults to 10).
    +* *maxIter* is the maximum number of iterations to run (defaults to 10).
    +* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
    +* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
    +  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
    +* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
    +  *baseline* confidence in preference observations (defaults to 1.0).
    +* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
    +
    +### Explicit vs. implicit feedback
    +
    +The standard approach to matrix factorization based collaborative filtering treats 
    +the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    +
    +It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
    +clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
    +from
    +[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    +Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
    +as a combination of binary preferences and *confidence values*. The ratings are then related to the
    +level of confidence in observed user preferences, rather than explicit ratings given to items.  The
    +model then tries to find latent factors that can be used to predict the expected preference of a
    +user for an item.
    +
    +### Scaling of the regularization parameter
    +
    +We scale the regularization parameter `regParam` in solving each least squares problem by
    +the number of ratings the user generated in updating user factors,
    +or the number of ratings the product received in updating product factors.
    +This approach is named "ALS-WR" and discussed in the paper
    +"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
    +It makes `regParam` less dependent on the scale of the dataset.
    +So we can apply the best parameter learned from a sampled subset to the full dataset
    +and expect similar performance.
    +
    +## Examples
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +In the following example, we load rating data from the
    +[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
    --- End diff --
    
    Nope, it's in the `data` folder, it's just to tell people where we got the dataset from.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182601948
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51051/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166552068
  
    **[Test build #48176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48176/consoleFull)** for PR 10411 at commit [`b086ffd`](https://github.com/apache/spark/commit/b086ffd7426437dba1b49a20cdc8386635f20134).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-167126043
  
    **[Test build #48308 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48308/consoleFull)** for PR 10411 at commit [`4176788`](https://github.com/apache/spark/commit/41767888f99dcaf5a9c2deadf5185ad28dfc6b7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166430031
  
    **[Test build #48130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48130/consoleFull)** for PR 10411 at commit [`ab0f301`](https://github.com/apache/spark/commit/ab0f301cc6d9cfa5b4a9f1da733859db52ef7f83).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166437880
  
    **[Test build #48130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48130/consoleFull)** for PR 10411 at commit [`ab0f301`](https://github.com/apache/spark/commit/ab0f301cc6d9cfa5b4a9f1da733859db52ef7f83).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class JavaALSExample `\n  * `  case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182263500
  
    **[Test build #51032 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51032/consoleFull)** for PR 10411 at commit [`2603e42`](https://github.com/apache/spark/commit/2603e4281b8a6bb5633e752117a112d3544c892a).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166557203
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-183717003
  
    **[Test build #51242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51242/consoleFull)** for PR 10411 at commit [`7e72c60`](https://github.com/apache/spark/commit/7e72c60718e59862c0fb8cf0389f8ee93f648990).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166267186
  
    **[Test build #48109 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48109/consoleFull)** for PR 10411 at commit [`0787362`](https://github.com/apache/spark/commit/078736232378f9a98593d7a339b1545c801c6f3c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-167119997
  
    **[Test build #48308 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48308/consoleFull)** for PR 10411 at commit [`4176788`](https://github.com/apache/spark/commit/41767888f99dcaf5a9c2deadf5185ad28dfc6b7f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52728386
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -0,0 +1,148 @@
    +---
    +layout: global
    +title: Collaborative Filtering - spark.ml
    +displayTitle: Collaborative Filtering - spark.ml
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Collaborative filtering 
    +
    +[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
    +is commonly used for recommender systems.  These techniques aim to fill in the
    +missing entries of a user-item association matrix.  `spark.ml` currently supports
    +model-based collaborative filtering, in which users and products are described
    +by a small set of latent factors that can be used to predict missing entries.
    +`spark.ml` uses the [alternating least squares
    +(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
    +algorithm to learn these latent factors. The implementation in `spark.ml` has the
    +following parameters:
    +
    +* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
    +* *rank* is the number of latent factors in the model (defaults to 10).
    +* *maxIter* is the maximum number of iterations to run (defaults to 10).
    +* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
    +* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
    +  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
    +* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
    +  *baseline* confidence in preference observations (defaults to 1.0).
    +* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
    +
    +### Explicit vs. implicit feedback
    +
    +The standard approach to matrix factorization based collaborative filtering treats 
    +the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    +
    +It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
    +clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
    +from
    +[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    +Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
    +as a combination of binary preferences and *confidence values*. The ratings are then related to the
    +level of confidence in observed user preferences, rather than explicit ratings given to items.  The
    +model then tries to find latent factors that can be used to predict the expected preference of a
    +user for an item.
    +
    +### Scaling of the regularization parameter
    +
    +We scale the regularization parameter `regParam` in solving each least squares problem by
    +the number of ratings the user generated in updating user factors,
    +or the number of ratings the product received in updating product factors.
    +This approach is named "ALS-WR" and discussed in the paper
    +"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
    +It makes `regParam` less dependent on the scale of the dataset.
    +So we can apply the best parameter learned from a sampled subset to the full dataset
    --- End diff --
    
    Nit: "... dataset, so that we can ..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52727882
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.sql.SQLContext;
    +
    +// $example on$
    +import java.io.Serializable;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.ml.evaluation.RegressionEvaluator;
    +import org.apache.spark.ml.recommendation.ALS;
    +import org.apache.spark.ml.recommendation.ALSModel;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.types.DataTypes;
    +// $example off$
    +
    +public class JavaALSExample {
    +
    +  // $example on$
    +  public static class Rating implements Serializable {
    +    private int userId;
    +    private int movieId;
    +    private float rating;
    +    private long timestamp;
    +
    +    public int getUserId() {
    +      return userId;
    +    }
    +
    +    public void setUserId(int userId) {
    --- End diff --
    
    To keep the example simpler, do you really need setters instead of just constructor args? I personally am used to that as the default, with final fields.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182262594
  
    **[Test build #51032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51032/consoleFull)** for PR 10411 at commit [`2603e42`](https://github.com/apache/spark/commit/2603e4281b8a6bb5633e752117a112d3544c892a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52830580
  
    --- Diff: docs/mllib-collaborative-filtering.md ---
    @@ -32,16 +32,16 @@ following parameters:
     
     The standard approach to matrix factorization based collaborative filtering treats 
     the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    +For example, users giving ratings to movies.
     
     It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
     clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
    -from
    -[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    -Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
    -as a combination of binary preferences and *confidence values*. The ratings are then related to the
    -level of confidence in observed user preferences, rather than explicit ratings given to items.  The
    -model then tries to find latent factors that can be used to predict the expected preference of a
    -user for an item.
    +from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    +Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
    --- End diff --
    
    @srowen tried to take your remarks into account, I don't know if it's clearer now though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166834382
  
    **[Test build #48234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48234/consoleFull)** for PR 10411 at commit [`e336ebd`](https://github.com/apache/spark/commit/e336ebda840ca7b0cf13baaf7de1ca5c2f6abeb9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-184675125
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-184130751
  
    Great, I'll do that later today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-183324222
  
    @srowen thanks for the review, will make the necessary changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-184335470
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51318/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-183717042
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166549963
  
    **[Test build #48174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48174/consoleFull)** for PR 10411 at commit [`3a860b1`](https://github.com/apache/spark/commit/3a860b16206483ea5cba3a309a5855a71adb4304).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166560317
  
    **[Test build #48176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48176/consoleFull)** for PR 10411 at commit [`b086ffd`](https://github.com/apache/spark/commit/b086ffd7426437dba1b49a20cdc8386635f20134).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class JavaALSExample `\n  * `  public static class Rating implements Serializable `\n  * `  case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166274716
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52727772
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala ---
    @@ -0,0 +1,82 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +// $example on$
    +import org.apache.spark.ml.evaluation.RegressionEvaluator
    +import org.apache.spark.ml.recommendation.ALS
    +// $example off$
    +import org.apache.spark.sql.SQLContext
    +// $example on$
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.DoubleType
    +// $example off$
    +
    +object ALSExample {
    +
    +  // $example on$
    +  case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
    +  object Rating {
    +    def parseRating(str: String): Rating = {
    +      val fields = str.split("::")
    +      assert(fields.size == 4)
    +      Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
    +    }
    +  }
    +  // $example off$
    +
    +  def main(args: Array[String]) {
    +    val conf = new SparkConf().setAppName("ALSExample")
    +    val sc = new SparkContext(conf)
    +    val sqlContext = new SQLContext(sc)
    +    import sqlContext.implicits._
    +
    +    // $example on$
    +    val ratings = sc.textFile("data/mllib/als/sample_movielens_ratings.txt")
    --- End diff --
    
    It looks like this file was removed though right? is it because we can't distribute even a sample of it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52728138
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -0,0 +1,148 @@
    +---
    +layout: global
    +title: Collaborative Filtering - spark.ml
    +displayTitle: Collaborative Filtering - spark.ml
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Collaborative filtering 
    +
    +[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
    +is commonly used for recommender systems.  These techniques aim to fill in the
    +missing entries of a user-item association matrix.  `spark.ml` currently supports
    +model-based collaborative filtering, in which users and products are described
    +by a small set of latent factors that can be used to predict missing entries.
    +`spark.ml` uses the [alternating least squares
    +(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
    +algorithm to learn these latent factors. The implementation in `spark.ml` has the
    +following parameters:
    +
    +* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
    +* *rank* is the number of latent factors in the model (defaults to 10).
    +* *maxIter* is the maximum number of iterations to run (defaults to 10).
    +* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
    +* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
    +  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
    +* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
    +  *baseline* confidence in preference observations (defaults to 1.0).
    +* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
    +
    +### Explicit vs. implicit feedback
    +
    +The standard approach to matrix factorization based collaborative filtering treats 
    +the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    --- End diff --
    
    Worth giving "ratings" as the canonical example of explicit feedback?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166560438
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166557205
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48174/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166845081
  
    **[Test build #48234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48234/consoleFull)** for PR 10411 at commit [`e336ebd`](https://github.com/apache/spark/commit/e336ebda840ca7b0cf13baaf7de1ca5c2f6abeb9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-183717044
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51242/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166274432
  
    **[Test build #48109 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48109/consoleFull)** for PR 10411 at commit [`0787362`](https://github.com/apache/spark/commit/078736232378f9a98593d7a339b1545c801c6f3c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class JavaALSExample `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52830587
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -0,0 +1,147 @@
    +---
    +layout: global
    +title: Collaborative Filtering - spark.ml
    +displayTitle: Collaborative Filtering - spark.ml
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Collaborative filtering 
    +
    +[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
    +is commonly used for recommender systems.  These techniques aim to fill in the
    +missing entries of a user-item association matrix.  `spark.ml` currently supports
    +model-based collaborative filtering, in which users and products are described
    +by a small set of latent factors that can be used to predict missing entries.
    +`spark.ml` uses the [alternating least squares
    +(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
    +algorithm to learn these latent factors. The implementation in `spark.ml` has the
    +following parameters:
    +
    +* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
    +* *rank* is the number of latent factors in the model (defaults to 10).
    +* *maxIter* is the maximum number of iterations to run (defaults to 10).
    +* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
    +* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
    +  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
    +* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
    +  *baseline* confidence in preference observations (defaults to 1.0).
    +* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
    +
    +### Explicit vs. implicit feedback
    +
    +The standard approach to matrix factorization based collaborative filtering treats 
    +the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    +For example, users giving ratings to movies.
    +
    +It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views, 
    +clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
    +from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    +Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
    --- End diff --
    
    @srowen tried to take your remarks into account, I don't know if it's clearer now though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52738715
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala ---
    @@ -0,0 +1,82 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +// scalastyle:off println
    +package org.apache.spark.examples.ml
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +// $example on$
    +import org.apache.spark.ml.evaluation.RegressionEvaluator
    +import org.apache.spark.ml.recommendation.ALS
    +// $example off$
    +import org.apache.spark.sql.SQLContext
    +// $example on$
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.DoubleType
    +// $example off$
    +
    +object ALSExample {
    +
    +  // $example on$
    +  case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
    +  object Rating {
    +    def parseRating(str: String): Rating = {
    +      val fields = str.split("::")
    +      assert(fields.size == 4)
    +      Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
    +    }
    +  }
    +  // $example off$
    +
    +  def main(args: Array[String]) {
    +    val conf = new SparkConf().setAppName("ALSExample")
    +    val sc = new SparkContext(conf)
    +    val sqlContext = new SQLContext(sc)
    +    import sqlContext.implicits._
    +
    +    // $example on$
    +    val ratings = sc.textFile("data/mllib/als/sample_movielens_ratings.txt")
    --- End diff --
    
    Nope, the one removed is `sample_movielens_movies.txt`  as it was only used in `MovieLens.scala` which has been removed, cf the discussion on the jira.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182089686
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50995/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52829428
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.sql.SQLContext;
    +
    +// $example on$
    +import java.io.Serializable;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.ml.evaluation.RegressionEvaluator;
    +import org.apache.spark.ml.recommendation.ALS;
    +import org.apache.spark.ml.recommendation.ALSModel;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.types.DataTypes;
    +// $example off$
    +
    +public class JavaALSExample {
    +
    +  // $example on$
    +  public static class Rating implements Serializable {
    +    private int userId;
    +    private int movieId;
    +    private float rating;
    +    private long timestamp;
    +
    +    public int getUserId() {
    +      return userId;
    +    }
    +
    +    public void setUserId(int userId) {
    --- End diff --
    
    AFAIK, spark SQL only supports JavaBean according to the doc: https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection.
    
    So, public final fields with a constructor won't work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182263507
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182263510
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51032/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182601947
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52728448
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -0,0 +1,148 @@
    +---
    +layout: global
    +title: Collaborative Filtering - spark.ml
    +displayTitle: Collaborative Filtering - spark.ml
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Collaborative filtering 
    +
    +[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
    +is commonly used for recommender systems.  These techniques aim to fill in the
    +missing entries of a user-item association matrix.  `spark.ml` currently supports
    +model-based collaborative filtering, in which users and products are described
    +by a small set of latent factors that can be used to predict missing entries.
    +`spark.ml` uses the [alternating least squares
    +(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
    +algorithm to learn these latent factors. The implementation in `spark.ml` has the
    +following parameters:
    +
    +* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
    +* *rank* is the number of latent factors in the model (defaults to 10).
    +* *maxIter* is the maximum number of iterations to run (defaults to 10).
    +* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
    +* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
    +  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
    +* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
    +  *baseline* confidence in preference observations (defaults to 1.0).
    +* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
    +
    +### Explicit vs. implicit feedback
    +
    +The standard approach to matrix factorization based collaborative filtering treats 
    +the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    +
    +It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
    +clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
    +from
    +[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    +Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
    +as a combination of binary preferences and *confidence values*. The ratings are then related to the
    +level of confidence in observed user preferences, rather than explicit ratings given to items.  The
    +model then tries to find latent factors that can be used to predict the expected preference of a
    +user for an item.
    +
    +### Scaling of the regularization parameter
    +
    +We scale the regularization parameter `regParam` in solving each least squares problem by
    +the number of ratings the user generated in updating user factors,
    +or the number of ratings the product received in updating product factors.
    +This approach is named "ALS-WR" and discussed in the paper
    +"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
    +It makes `regParam` less dependent on the scale of the dataset.
    +So we can apply the best parameter learned from a sampled subset to the full dataset
    +and expect similar performance.
    +
    +## Examples
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +
    +In the following example, we load rating data from the
    +[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
    +consisting of a user, a movie, a rating and a timestamp.
    +We then train an ALS model which assumes, by default, that the ratings are
    +explicit (`implicitPrefs` is `false`).
    +We evaluate the recommendation model by measuring the root-mean-square error of
    +rating prediction.
    +
    +Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.ml.recommendation.ALS)
    +for more details on the API.
    +
    +{% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}
    +
    +If the rating matrix is derived from another source of information (e.g. it is
    --- End diff --
    
    Nit: you changed e.g. to i.e. below. Either is arguably fine but keep it consistent


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166845465
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48234/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-167126115
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48308/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r48418593
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.sql.SQLContext;
    +
    +// $example on$
    +import java.io.Serializable;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.ml.evaluation.RegressionEvaluator;
    +import org.apache.spark.ml.recommendation.ALS;
    +import org.apache.spark.ml.recommendation.ALSModel;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.types.DataTypes;
    +// $example off$
    +
    +public class JavaALSExample {
    +
    +  // $example on$
    +  public static class Rating implements Serializable {
    +    private int userId;
    +    private int movieId;
    +    private float rating;
    +    private long timestamp;
    +
    +    public int getUserId() {
    +      return userId;
    +    }
    +
    +    public void setUserId(int userId) {
    +      this.userId = userId;
    +    }
    +
    +    public int getMovieId() {
    +      return movieId;
    +    }
    +
    +    public void setMovieId(int movieId) {
    +      this.movieId = movieId;
    +    }
    +
    +    public float getRating() {
    +      return rating;
    +    }
    +
    +    public void setRating(float rating) {
    +      this.rating = rating;
    +    }
    +
    +    public long getTimestamp() {
    +      return timestamp;
    +    }
    +
    +    public void setTimestamp(long timestamp) {
    +      this.timestamp = timestamp;
    +    }
    +
    +    public static Rating parseRating(String str) {
    +      String[] fields = str.split("::");
    +      assert(fields.length == 4);
    +      Rating rating = new Rating();
    +      rating.setUserId(Integer.parseInt(fields[0]));
    +      rating.setMovieId(Integer.parseInt(fields[1]));
    +      rating.setRating(Float.parseFloat(fields[2]));
    +      rating.setTimestamp(Long.parseLong(fields[3]));
    +      return rating;
    +    }
    +  }
    +  // $example off$
    +
    +  public static void main(String[] args) {
    +    SparkConf conf = new SparkConf().setAppName("JavaALSExample");
    +    JavaSparkContext jsc = new JavaSparkContext(conf);
    +    SQLContext sqlContext = new SQLContext(jsc);
    +
    +    // $example on$
    +    JavaRDD<Rating> ratingsRDD = jsc.textFile("data/mllib/als/sample_movielens_ratings.txt")
    +      .map(new Function<String, Rating>() {
    +        public Rating call(String str) {
    +          return Rating.parseRating(str);
    +        }
    +      });
    +    DataFrame ratings = sqlContext.createDataFrame(ratingsRDD, Rating.class);
    +    DataFrame[] splits = ratings.randomSplit(new double[]{0.8, 0.2});
    +    DataFrame training = splits[0];
    +    DataFrame test = splits[1];
    +
    +    // Build the recommendation model using ALS on the training data
    +    ALS als = new ALS()
    +      .setMaxIter(5)
    +      .setRegParam(0.01)
    +      .setUserCol("userId")
    +      .setItemCol("movieId")
    +      .setRatingCol("rating");
    +    ALSModel model = als.fit(training);
    +
    +    // Evaluate the model by computing the RMSE on the test data
    +    DataFrame rawPredictions = model.transform(test);
    +    DataFrame predictions = rawPredictions
    +      .withColumn("rating", rawPredictions.col("rating").cast(DataTypes.DoubleType))
    +      .withColumn("prediction", rawPredictions.col("prediction").cast(DataTypes.DoubleType));
    --- End diff --
    
    There might be a better way to do this, input welcome.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182587163
  
    **[Test build #51051 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51051/consoleFull)** for PR 10411 at commit [`9021f36`](https://github.com/apache/spark/commit/9021f36d767b00fb2942e01ac6caba53c1466152).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-184335467
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-167126114
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166560440
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48176/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166557094
  
    **[Test build #48174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48174/consoleFull)** for PR 10411 at commit [`3a860b1`](https://github.com/apache/spark/commit/3a860b16206483ea5cba3a309a5855a71adb4304).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `public class JavaALSExample `\n  * `  public static class Rating implements Serializable `\n  * `  case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-183707910
  
    **[Test build #51242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51242/consoleFull)** for PR 10411 at commit [`7e72c60`](https://github.com/apache/spark/commit/7e72c60718e59862c0fb8cf0389f8ee93f648990).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52830051
  
    --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java ---
    @@ -0,0 +1,131 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.examples.ml;
    +
    +import org.apache.spark.SparkConf;
    +import org.apache.spark.api.java.JavaSparkContext;
    +import org.apache.spark.sql.SQLContext;
    +
    +// $example on$
    +import java.io.Serializable;
    +
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.api.java.function.Function;
    +import org.apache.spark.ml.evaluation.RegressionEvaluator;
    +import org.apache.spark.ml.recommendation.ALS;
    +import org.apache.spark.ml.recommendation.ALSModel;
    +import org.apache.spark.sql.DataFrame;
    +import org.apache.spark.sql.types.DataTypes;
    +// $example off$
    +
    +public class JavaALSExample {
    +
    +  // $example on$
    +  public static class Rating implements Serializable {
    +    private int userId;
    +    private int movieId;
    +    private float rating;
    +    private long timestamp;
    +
    +    public int getUserId() {
    +      return userId;
    +    }
    +
    +    public void setUserId(int userId) {
    --- End diff --
    
    yup


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-182089684
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-166438034
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-184130158
  
    @BenFradet yeah I like your last edit. If you're willing to make that change and the sentence fragment change I'll merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-183879749
  
    I'm OK merging this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by BenFradet <gi...@git.apache.org>.
Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/10411#issuecomment-174000491
  
    pinging @thunterdb and @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12247] [ML] [DOC] Documentation for spa...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10411#discussion_r52728349
  
    --- Diff: docs/ml-collaborative-filtering.md ---
    @@ -0,0 +1,148 @@
    +---
    +layout: global
    +title: Collaborative Filtering - spark.ml
    +displayTitle: Collaborative Filtering - spark.ml
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Collaborative filtering 
    +
    +[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
    +is commonly used for recommender systems.  These techniques aim to fill in the
    +missing entries of a user-item association matrix.  `spark.ml` currently supports
    +model-based collaborative filtering, in which users and products are described
    +by a small set of latent factors that can be used to predict missing entries.
    +`spark.ml` uses the [alternating least squares
    +(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
    +algorithm to learn these latent factors. The implementation in `spark.ml` has the
    +following parameters:
    +
    +* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
    +* *rank* is the number of latent factors in the model (defaults to 10).
    +* *maxIter* is the maximum number of iterations to run (defaults to 10).
    +* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
    +* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
    +  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
    +* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
    +  *baseline* confidence in preference observations (defaults to 1.0).
    +* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
    +
    +### Explicit vs. implicit feedback
    +
    +The standard approach to matrix factorization based collaborative filtering treats 
    +the entries in the user-item matrix as *explicit* preferences given by the user to the item.
    +
    +It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
    +clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
    +from
    +[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
    +Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
    +as a combination of binary preferences and *confidence values*. The ratings are then related to the
    --- End diff --
    
    This might just be my own way of wording it, but the input is construed as some kind of _strength_ value in implicit data. It's inherently count-like (e.g. additive) which is how it differs from ratings. The idea of confidence is pretty much an implementation detail. I would not say that "ratings are related to.." anything in this model; there are no rating-like quantities. It's not predicting the strength of a preference, really, but how much it's likely to exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org